nD-Al£7  782  THE  COHPUTRTIQNHL  COMPLEX I TV  OF  TUO-LEVEL  MORPHOLOGY 
(U)  MHSSnCHUSETTS  INST  OF  TECH  CAMBRIDGE  ARTIFICIAL 
INTELLIGENCE  LAB  G  E  BARTON  NOV  S5  AI-M-BSG 
UNCLASSIFIED  N08014-8a-C-0585  F/G  577 


! 

V 

MtCTOC'^'^  ’CvOwv-  ’ON  TEST  CHART 
MATtOMftV  •W*«e*W  OF  STaHOMO*  -  »»«5  '  * 


D-A167  782 


4S'  ^  “**  "^s  t'  •  K#«  C *if •'•tf 


^  < 


REPORT  DOCUMEMTATION  PAGE 


»eo;«  •  nuxSE" 


j»pp  READ  INSTRUCTIONS 

BEFORE  COMPLETING  FORM 
2  covT  ACcesfiOM  NO  I  >  acci^icnt's  c*t*i.oo  MuMsea 


AIM-856 


*  I  »nd  Swbnr/vt 


»  Tvac  oa  acaORT  4  acaioo  COwEaeo 


The  CompuCa clonal  Complexity  of 
Two-Level  Morphology 


AI-Memo 


«  acaaoRMtNS  o*e.  acaoRT  NUMaca 


t.  coNTaacT  OR  grant  numbcrk^ 


G.  Edward  Barton,  Jr. 


N00014-80-C-0505 


acaroNMiHC  organization  name  anO  aODRCSS 


Artificial  Intelligence  Laboratory 
545  Technology  Square 
Cambridge,  MA  02139 

I  CONTROLLING  oraicc  NAME  ANO  AOORtSE 

Advanced  Research  Projects  Agency 
1400  Wilson  Blvd. 

Arlington,  VA  22209  _ 


•  4  monitoring  agency  name  4  aOOREEVI*  Rl**a'aa<  I 

Office  of  Naval  Research 
Information  Systems 
Arlington,  VA  22217 


10.  aaoGRAM  element.  RRojeCT.  task 

AREA  4  aORK  UNIT  RUMaERS 


It.  RCaORT  DATE 


ICPsTaTl  :Tq»  A’j; 


>  CaaiMlIlRE  OMKal  I  <E.  SECURITV  CLASS.  /#•  WM*  raaaft; 


UNCLASSIFIED 


<4  OlSTRiauTION  STATEMENT  <a«  IAI4  R«aa*0 


Distribution  is  unlimited. 


IT.  OlSTRiauTiON  STATEMENT  (s2  tN*  a4«ifsct  aatSME  la  •l•«S  tt,  II  EIIMranl  Raa 


IS.  KEY  WORDS  fCaaMRNa  aa  ,»y«»4*  alws  flaaaaaaar  ••E  IRaalMR  Or  OlaaR 


Finite-State  machines 
Morphological  analysis 
Natural  language 


KIMMO  system 
Computational  complexity 
NF -completeness 


^3  VhToTi  iTSI  I 


20.  A9STff  ACT  m  #1  •f  mmMO 


Morphological  analysis  requires  knowledge  of  the  stems,  affixes, 
combinatory  patterns,  and  spelling-change  processes  of  a  language. 

The  computational  difficulty  of  the  task  can  be  clarified  by  investigatin 
the  computational  characteristics  of  specific  models  of  morphological 
processing.  The  use  of  finite-state  machinery  in  the  "two-level" 
model  by  Klmmo  Koskennlemi  gives  it  the  appearance  of  computational 
efficiency,  but  closer  examination  shows  the  model  does  not  guarantee*** 


OD  1  jan'tI  1^73  tOlTION  or  I  ROVEi  It  oetowtTt 
S/N  O'.Ol'OM'EMI  I 


UNCLASSIFIED 


5  21  or* 


***Block  20,  continued: 


efficient  processing.  Reductions  of  the  satisfiability  problem 
show  that  finding  the  proper  lexical-surface  correspondence  in  a 
two-level  generation  or  recognition  problem  can  be  computationally 
difficult.  However,  another  source  of  complexity  in  the  existing 
algorithms  can  be  sharply  reduced  by  changing  the  implementation 
of  the  dictionary  component.  A  merged  dictionary  with  bit-vectors 
reduces  the  number  of  choices  among  alternative  dictionary  subdivisions 
by  allowing  several  subdivisions  to  be  searched  at  once. 


DITG 

?|  J  i  i  ^ 


MASSACHUSETTS  INSTITUTE  OF  TECHNOLOGY 
ARTIFICIAL  INTELLIGENCE  LABORATORY 


A.I.  Memo  No.  856 


November,  1085 


THE  COMPUTATIONAL  COMPLEXITY 
OF  TWO-LEVEL  MORPHOLOGY 

G.  Edward  Barton,  Jr. 


r «  L 

ABSTRACT: 

Morphological  analysis  requires  knowledge  of  the  stems,  affixes,  combinatory  patterns, 
and  spelling-change  processes  of  a'language.  The  computational  difficulty  of  the  task  can  be 
clarified  by  investigating  the  computational  characterises  of  spe^ifij:  models  of  morphological 
processing.  The  use  of  finite-state  machinery  in  the  ^wolevcrmodel  b^  Kimmo  Kosken- 
niemi  gives  it  the  appearance  of  computational  efficiency,  but  closer  examination  shows  the 
model  does  not  guarantee  efficient  processing.  Reductions  of  the  siitisfiability  problem  show 
that  finding  the  proper  lexical  -surface  correspondence  in  a  two-level  generation  or  recogni-  ^ 
tion  problem  can  be  computationally  difficult.  However,  another  source  of  complexity  in  the  * 
existing  algorithms  can  be  sharply  reduced  by  changing  the  implementation  of  the  dictio¬ 
nary  component.  A  merged  dictionary  with  bit-vectors  reduces  the  number  of  choices  among 
alternative  dictionary  subdivisions  by  allowing  several  subdivisions  to  be  searched  at  once. 


This  report  describes  research  done  at  the  Artificial  Intelligence  Laboratory  of  the  Massachusetts 
Institute  of  Technology.  Support  for  the  Laboratory's  artiheiai  intelligence  research  has  been  provided 
in  part  t>y  the  Advanced  Research  Projects  Agency  of  the  Department  of  Defense  under  Office  of 
Naval  Research  contract  N()0()14-80-C-0505.  A  versum  of  this  p:iper  w!»s  jwesentcxl  to  the  Workshop 
on  Finite-State  Mor]>hology,  C<-ntcr  for  the  Study  of  Laiignnge  and  bi  format  ion.  Stanford  University, 
July  20  30,  1085;  the  author  is  grateful  to  Lanri  K.arttniien  for  making  that  presentation  possible. 
Dob  Derwick  has  provided  useful  guidance  and  commentary  in  this  re.search,  and  the  pn|>er  has  also 
bc<^n  improved  in  response  to  suggestions  from  Donnie  Dorr  and  Eric  Griiiison. 


©Massachusetts  Institute  of  Technology,  1085 


1.  Introduction 


The  “dictionary  lookup"  stage  in  a  sophisticated  natural-language  system  can  involve 
much  more  than  simple  information  retrieval.  In  text,  the  words  that  the  system  knows  may 
show  up  in  heavily  disguised  form.  InflMtional  endings  such  as  tense  and  plural  markings  may 
be  present;  the  addition  of  prefixea  and  suffixes  may  change  part•of-spc^cch  and  meaning  in 
systematic  ways;  in  many  languages  words  may  have  unrelated  clitics  attached.  The  addition 
of  prefixes,  suffixes,  and  endings  is  often  accompanied  by  spelling  changes  as  well;  in  English, 
try^s  becomes  tries  and  dlg-^er  becomes  digger.  The  rules  of  spelling  change  can  be  rather 
complex. 

Superficially,  it  seems  that  word  rcc<%nition  might  potentially  be  complicated  and  dif¬ 
ficult.  This  pajier  examines  the  question  more  formally  by  investigating  the  computational 
cliaracteristics  of  the  “two- level"  model  of  niorphological  processes  (§2).  Given  the  kinds  of 
constrmnts  that  can  be  encoded  in  the  model,  how  difficult  can  it  be  to  translate  between 
lexical  and  surface  forms?  Although  the  use  of  finite-state  machinery  in  the  two-levd  model 
gives  it  the  appearance  of  coniputational  efficiency,  the  model  itself  docs  not  guarantee  ef¬ 
ficient  processing.  Taking  the  KiMMO  system  (Karttunen,  1983)  for  concreteness,  sections  4 
and  6  wiU  show  that  the  general  problem  of  mapping  between  lexical  and  surface  forms  in  two- 
level  systems  is  computationally  difficult  in  the  worst  case.  If  null  characters  are  excluded, 
the  problem  is  A/ ^-complete.  If  null  characters  are  completely  unrestricted,  the  problem  is 
PSPACE-complete  and  thus  probably  even  harder  in  the  worst  case.  The  fundamental  diffi¬ 
culty  of  the  problems  does  not  seem  to  be  a  precompilation  effect  (§5). 


1.1.  Morphological  analysis  • 


The  word-level  processing  carried  out  by  a  natural-language  system  is  formally  a  type  of 
morphologienl  analytit,  concerned  with  recovering  the  internal  structures  of  input  words.  For 
example,  singing  can  be  recognized  as  an  inflected  form  of  the  verb  sing,  while  unhappy 
can  be  analyzed  as  un'*’happy.  However,  the  niorphological  component  cannot  break  words  up 
blindly;  despite  appearances,  duckling  is  not  the  -ing  form  of  a  verb.  The  morphological 
analyzer  must  know  the  basic  words  of  the  language  in  addition  to  the  prefixes  and  suffixes.  In 
fact,  analysis  must  be  guided  by  more  specific  constraints  as  well.  Not  every  word  can  combine 
with  every  affix;  it  would  be  an  error  to  analyze  unit  as  un+it  or  beer  as  be+er  (compare 
doer). 

The  number  of  inflected  forms  of  a  given  word  is  smaller  in  English  than  in  many  other 
languages.  As  a  result,  for  a  system  with  small  scope  it  often  suffices  to  trivialize  morphological 
amdysis  by  listing  all  inflected  forms  in  tlie  dictionary  directly.  The  trivial  approach  is  not 
feasible  for  heavily  inflected  languages  such  ns  Finnish,  in  which  a  word  ran  have  thousands 
of  possible  forms.  In  such  cases,  both  practicality  and  elegtuice  require  a  more  systematic 
treatment  in  terms  of  inflectional  endings,  mood  and  tense  markers,  clitics,  and  so  forth. 

The  problem  of  recovering  the  internal  structures  of  words  can  take  an  extreme  form 
in  languages  that  allow  productive  compounding.  Kay  and  Ka]>laii  (1982)  illustrate  such  a 
situation  with  the  German  word  LebensverBicherungsgesellschaftsangestellter,  wliich 
means  lije  insurance  company  rmployt.e.  An  I'xhaustive  diclioiiiiry  is  impractical  when  such 
free  compounding  is  possible. 


Avaiiabitity  Codes 

,  1 

Dist 

/?'/ 

Avail  and/or 
Special 

1.2.  Spelling  changes 

Besides  knowini;  the  stems,  affixes,  and  co-occurrence  restrictions  of  a  language,  a  success¬ 
ful  morphological  analyzer  must  take  into  account  the  tptlUng  change*  that  often  accompany 
the  addition  of  suffixes  and  similar  elements.*  The  program  must  expect  lovc^-ing  to  appear 
as  loving,  fly-^s  as  flies,  lie-»ing  as  lying,  and  bigger  as  bigger.  Its  knowledge  must  be 
sufficiently  sopliisticated  to  distinguish  such  surface  forms  as  hopped  (=  hop'+ed)  and  hoped 
(=  hope+ed).  Cross-linguistically,  spelling-change  processes  may  span  either  a  limited  or  a 
more  extended  range  of  characters  (§1.2.1).  and  tlie  material  that  triggers  a  change  may  occur 
either  before  or  after  the  character  that  is  affected  (§1.2.2).  Complex  copying  processes  (§1.2.4) 
may  be  found  in  addition  to  simpler,  more  i^>ecific  changes. 

1.2.1.  Local  and  long-distance  proceaaas 

The  spelling  clianges  associated  with  the  addition  of  English  suffixes  are  local  in  the  sense 
that  they  do  not  affect  letters  far  away  from  the  word -suffix  boundary.  However,  there  are 
processes  in  other  languages  that  operate  over  longer  distances.  The  spelling  of  Turkish  suffixes 
is  systematically  affected  by  vowel  harmony  processes,  which  require  the  vowels  in  a  word  to 
agree  in  certain  respects.*  The  vowels  that  iqipear  in  a  typical  suffix  arc  not  completely 
determined  by  the  suffix,  hut  arc  determined  in  part  by  the  rules  of  vowel  harmony.  The  suffix 
that  Underhill  (1976)  writes  as  -siniz  may  appear  in  an  actual  word  as  -siniz,  -sumiz,  - 
sUnhz,  or  -siniz  depending  on  the  preceding  vowel.  IVirkish  words  may  contain  large  numbers 
of  suffixes,  and  the  effects  of  vowd  harmony  can  propagate  for  long  distances.  (Hungarian 
suffixes  display  similar  changes.) 

1.2.2.  Left  and  right  context 

LociU  spelling  changes  often  depend  on  right  context  as  well  as  left  context;  for  instance, 
carry-^ed  changes  y  to  1  but  carry^ing  retains  y.  Less  commonly,  long-distance  changes  can 
also  be  triggered  by  material  to  the  right.*  Verb  stems  in  tbc  Australian  language  Warlpiri 
diitplay  a  regressive  diangc  of  i  to  u  triggered  by  a  temse  suffix  containing  a  nasal  u;  thus  the 
imperative  form  of  throw  is  kiji-ka,  but  the  past-tense  form  is  kuju-rnu  (Nash,  1980:84). 
As  illustrated,  tliis  harmony  process  can  affect  more  than  one  i  in  the  verb  stem.  It  can  also 
propagate  through  the  element  -mi  that  can  appear  between  the  verb  stem  and  the  tense 

'  SpclIiuR-cbuoRr  proettwes  actually  represent  a  ■uper&cial  amalgain  of  pboaolugical  chanzes  and  ertho- 
graphic  coiivontions.  In  this  paper,  these  two  osperts  of  spelling  changes  will  not  he  distinguished.  Tbe 
phonology  and  the  orthography  of  a  language  do  not  have  the  same  status  for  linguisUct,  but  the  ditfereuccs 
are  not  relevant  fur  present  ]>urposcs.  Note  also  that  it  is  the  siirfare  qtcllitig  of  a  word  that  will  be  presented 
to  a  program  that  analyses  written  test. 

^For  details  of  this  process,  see  Underbill  (1070),  Clements  and  Sescr  (1082),  and  numerous  references  cited 
theu-ein. 

^Miiny  enrrent  analyses  of  vossel  harmony  take  it  to  be  a  iiiiidaiiirntally  nondircctionol  jwoceos,  even  in 
languages  in  whieh  it  always  ap|>e<ue  to  operate  from  left  to  right.  For  example,  it  appears  as  thonidi  the 
influence  of  nxit  vowels  on  affix  vctsnJs  idways  prtM-eeds  from  h-fl  to  riglit  in  Tiirkish,  but  this  is  because 
Turkish  huks  prefixes.  Clriiiriils  and  S«'ser  (1062:2400')  dis«-iiss  a  jinicess  of  colltiqiiial  Turkish  in  whieh  a 
vow«’l  is  inwrt<'<l  betw<'en  the  iiiilhd  letters  <if  eertiiiu  words.  The  cbiace  ut  vowel  is  determined  by  the  usual 
harmony  rules  of  IHirkish.  hut  o|>rrnting  from  right  to  left  in  this  cose.  Si’s  also  Poser  (1062). 


ending.  (Wailpiri  also  has  another  long-distance  harmony  process,  which  operates  from  right 
to  left.) 

Other  languages  provide  further  examples  of  long-distance  changes  that  arc  conditioned 
by  material  to  the  right.  Kay  and  Kaplan  (1082)  mention  a  vowel-change  process  in  Icelandic 
that  causes  vowels  in  the  middle  of  a  word  to  depend  on  the  vowels  in  a  following  suffix.  The 
inflectional  system  of  German  also  involves  vowel  changes.  Poser  (1982:1316')  discusses  an 
extreme  example  of  long-distance  right-to-lcft  harmony  that  occurs  in  the  language  Chumash. 
The  process  that  he  describes  changes  s  to  B  throughout  the  entire  word  when  an  B  occurs  in 
a  suffix;  thus  s^lu'^siain-^waB  {S+all+grow  awry+pant)  becomi^  BluBiBinwaB  (it  it  all  grown 
awry). 

1.2.3.  Right  context  and  processing  ambiguity 

The  existence  of  changes  that  depend  on  right  context  implies  that  the  lexical-surface 
correspondence  for  a  particular  character  cannot  always  be  determined  when  the  character  is 
first  seen  in  a  left-to-right  scan.  However,  right  context  is  not  crucial  for  the  occurrence 
this  difficulty.  The  same  kind  of  local  ambiguity  can  arise  even  when  spelling  changes  do  not 
depend  on  right  context. 

Suppose  we  were  to  remove  the  dependence  of  the  y-to-i  change  on  right  context  by  con¬ 
sidering  a  rule  system  in  which  y  always  changes  to  i  after  p.*  There  could  still  be  uncertainty 
about  how  analysis  should  proceed.  A  surface  string  beginning  spi. . .  could  correspond  to  a 
lexical  string  spy. . .  as  in  spies,  but  it  could  equally  well  correspond  to  spi. . .  as  in  spider 
or  spiel.  In  general,  analysis  may  proceed  several  characters  beyond  a  choice  point  before  it 
becomes  apparent  wliich  choice  is  correct.  This  is  especially  true  with  a  large  system  vocabu¬ 
lary;  in  the  above  ex.'unple,  a  system  that  did  not  know  any  spi . . .  words  could  immediately 
rule  out  spi. . .  in  favor  of  spy. . .,  but  a  system  with  more  complete  coverage  would  have  to 
look  further  into  the  input  before  it  could  identify  the  correct  choice. 

1.2.4.  Reduplication 

Some  languages  display  a  kind  of  change  called  reduplication  that  often  does  not  lend 
itself  to  analysis  by  the  kind.s  of  mechanisms  th.at  ore  appropriate  for  the  other  processes  that 
have  been  mentioned  here.  Reduplication  processes  involve  the  copying  of  consonants,  vowels, 
syllables,  roots,  or  other  subunits  of  words.  Nash  (1980:13611)  describes  a  rcdui>lication  process 
in  Warlpiri  that  copies  the  first  two  syllables  of  a  verb  and  has  varioiis  semantic  effects.  For 
example,  he  cites  the  sentence 

plrli  ka  parnta-pamta-rri-nja-mpa  ya-ni 

hill  PRES  crouch-REDUP  INF-across  go-NONPAST 

The  mountain  extends  in  a  series  of  humps. 

^If  y  alwairs  ciiangrs  to  1  after  p*  wh.'it  jiistiScatioii  con].]  there  be  for  snyiiig  that  spy  and  not  spi  is  the 
correct  \uiderlyiug  f<iriii?  In  this  triviiU  constructed  exiuiiple.  tlicre  is  noni'.  In  mi  nctual  I.-iiiRiiai'e.  there  could 
be  evidence  from  a  variety  of  sources:  siiilixes  beKiuiiiuR  with  y:  liarnmny  processes;  rules  that  create  or  destroy 
the  p  that  triggers  the  change;  rules  that  are  triggered  by  the  y  liefore  it  chmiges;  .mid  so  forth. 


t' 

»• 

r 

w 

jT 

« 

1* 

** 


in  which  the  verb  stem  pamtarri-  ha»  undergone  reduplication.’  Licber’s  (1980:234ff)  dia- 
cussion  of  several  reduplication  processes  in  the  language  Tagalog  provides  other  examples. 
One  Tagalog  reduplication  process  copies  the  first  consonant  and  vowel  of  the  stem,  making 
the  copied  vowel  short;  another  is  similar,  but  makes  the  copied  vowel  long;  a  third  process 
copies  the  first  syUablc  and  part  or  all  of  the  second,  lengthening  the  copied  vowel  of  the  second 
syUable.  See  also  McCarthy’s  (1982:193f)  treatment  of  reduplication  in  Classical  Arabic.* 


'’Tlir  liyjilirDM  in  tbr  Wnrlpiri  namplni  arc  married  an  analytical  aid  ibr  the  reader,  and  do  not  conionn 
to  the  iitnuci.'U'd  orUiacrspiiy  (Hide,  1082:323). 

’'Mct'artliy'e  trrntineiit  of  Arabic  ia  <if  theoretical  iatemt  for  at  leaet  two  rraaouM:  it  lirlpa  ilhiniioata  the 
nature  of  linguistic  re prew  ntatioue.  and  it  ehowe  a  way  to  derive  many  charactcrietic*  Arabic  reduplication 
from  iinivenuU  linfpiietic  principles  rather  than  lanRu.txc-particnlar  stipnlations. 


iw' 


2.  Two-Level  Morphology 


Given  a  description  of  the  root  forma,  the  combinatory  patterns,  and  the  spelling-change 
rules  of  a  language,  the  morphological  analysis  task  is  well-defined  in  an  abstract  sense.  How¬ 
ever,  a  practical  morphological  analyser  also  needs  an  efficient  way  of  putting  its  linguistic 
knowledge  to  use  in  actual  processing.  The  KiMMO  system  described  by  Karttunen  (1983) 
is  attractive  for  this  purpose.  KiMMO  is  an  implementation  of  the  “two-level”  model  of  mor¬ 
phological  analysis  that  Kimmo  Koskenniemi  proposed  and  developed  in  his  Ph.D.  thesis.^ 
Spelling-change  rules  arc  encoded  in  a  finite-state  automaton  component,  while  roots  and  af¬ 
fixes  are  listed  with  their  co-occurrence  restrictions  in  a  dictionary  component.  The  focus 
here  is  on  the  automaton  component.  (Reduplication  processes  find  no  easy  treatment  in  the 
Kimmo  system,  and  will  henceforth  be  ignored.) 

2.1.  The  Automaton  Component 

The  two-level  model  is  concerned  with  the  representation  of  a  word  at  two  distinct  levels, 
the  lexical  or  dictionary  level  and  the  surface  level.  At  the  surface  level,  words  arc  represented 
as  they  might  show  up  in  text.  At  the  lexical  level,  words  consist  of  sequences  of  stems, 
affixes,  diacritics,  and  boundary  markers  that  have  been  pasted  together  without  spelling 
cliangcs.  Thus  Kartttmcn  and  Wittenburg  (1983)  represent  the  surface  form  tries  as  try^s 
at  the  lexical  level.  Similarly,  the  Warlpiri  surface  form  kijika  might  be  represented  at  the 
lexical  level  as  kljl-ka,  where  I  is  a  special  lexical  character  that  can  surface  as  either  1  or 
u  according  to  harmony  rules. 

3.1.1.  Expressing  Spelling  Changes  as  Two-Level  Automata 

A  spelling-change  rule  in  the  two-level  model  is  expressed  as  a  constraint  on  the  corre¬ 
spondence  between  lexical  and  surface  strings.  For  example,  consider  a  simplified  “Y-Change” 
process  that  changes  y  to  1  before  adding  es.  Y-Change  can  be  coepressed  in  the  two-level 
model  as  a  constraint  on  the  appearance  of  the  lexical -surface  pairs  y/y  and  y/i.  Lexical  y 
must  correspond  to  surface  i  rather  than  surface  y  when  it  occurs  before  lexical  -^s,  which  will 
itself  come  out  as  surface  es  due  to  the  operation  of  other  constraints. 

Each  constraint  is  encoded  as  a  finite-state  machine  with  two  scaiming  heads  that  move 
along  the  lexical  and  surface  strings  in  parallel.  The  machine  starts  out  in  state  1,  and  at  each 
step  of  its  oi>eratioii,  it  changes  state  based  on  its  current  state  fuid  the  p.air  of  chariicters  it 
is  scaiming.  The  automaton  that  eiicodi's  the  Y-Clumge  constraint  would  be  described  by  the 


^XTniversity  of  Helsinki,  Finland,  ctrea  Fall  1983. 


foUowing  state  table: 


"Y-Changa" 

6  6 

y 

y 

♦ 

a 

- 

(lexical  eharaeters) 

1 

y 

■ 

a 

= 

(surface  eharaetert) 

Hate  1;  2 

4 

1 

1 

1 

(normal  Hate) 

state  2.  0 

0 

3 

0 

0 

(require  *») 

state  3 .  0 

0 

0 

1 

0 

(require  a) 

Hate  4 :  2 

4 

6 

1 

1 

(forbid  *•) 

Hate  6 :  2 

4 

1 

0 

1 

(forbid  a) 

In  this  notation,  taken  from  Karttnncn  (1983)  following  Koskenniemi,  is  a  certain  kind  of 
wildcard  character.  The  use  of  :  rather  than  .  after  the  statc-inunber  on  some  lines  indicates 
that  the  :  states  arc  final  states,  which  will  accept  end-of-input.  In  order  to  handle  insertion  or 
deletion,  it  is  also  possible  to  have  a  null  cliaracter  0  on  one  side  of  a  pair,^  but  the  possibilitjr 
of  nulla  will  not  be  given  full  consideration  until  section  6. 

In  processing  the  lexical-surface  string  pair  try-^s /tries,  the  automaton  would  tun 
through  the  state  sequence  2,3,1  and  accept  the  correspondence.  In  contrast,  with  the 
string  pziir  try^s/tryes  it  would  block  on  s/s  after  the  state  sequence  1,1, 1,4,5  because  the 
entry  for  s/s  in  state  6  is  sero.  With  the  pair  try/tri  it  would  not  block  with  any  sero 
entries,  but  would  still  reject  the  pair  because  it  would  end  up  in  state  2,  which  is  designated 
as  non*final. 

These  examples  illustrate  how  the  Y-Change  automaton  implements  dependence  on  the 
right  context  -^s.  The  automaton  will  accept  either  of  the  correspondences  y/i  and  y/y,  but 
if  it  processes  the  y/1  correspondence,  it  wiQ  enter  a  sequence  of  states  that  will  ultimately 
block  unless  the  y/i  pair  is  followed  by  the  appropriate  lexical  context  ♦s.  The  right  context 
fur  a  vowel  harmony  process  might  seem  more  difficult  to  encode  because  it  may  be  necessary 
to  ignore  several  intervening  consonants^  but  such  a  situation  actually  presents  no  problem  at 
all.  An  automaton  state  can  easily  ignore  irrelevant  characters  by  looping  back  to  itself. 

2.1.2.  Multiple  Spelling-Change  Processes 

A  language  will  generally  exhibit  several  different  spelling-change  processes;  for  example, 
Karttunen  (1983:177)  mentions  that  Koskenniemi’s  analysis  of  Finnish  uses  21  rules.  By  and 
large,  these  separate  processes  can  be  encoded  as  separate  automata  in  the  KiMMO  system. 
In  actual  processing,  the  automata  that  express  various  sprlling-cliange  constraints  will  all 
inspect  the  lexical- surface  correspondence  in  parallel.  The  corresptmdence  will  be  accepted 
only  if  every  automaton  accepts  it  —  that  is,  if  it  satisfies  every  constraint.®  Because  the 
aiironiata  are  connected  in  parallel  rather  than  in  series,  there  are  no  “feeding”  relationships 
between  twolevel  automata.**'  Figmre  1  illustrates  the  parallel  arrangement  of  the  KiMMO 

"Tlir  itctiiiJ  KIMMO  sysrciii  cif  Karttunen  (ltM3)  ilucs  not  allcuw  null  rhorartcr*  ut  Uie  lexical  level,  but  the 
oniimiim  in  iucKiential  (Karttunen,  p.c,). 

'If  null  chitrsictent  are  ollowi'd,  the  intrr|>rrtation  of  “s.'itiafjring  every  ronstraiiit"  takes  on  a  certain  subtlety. 
See  nertiiin  6. 

'"It  is  a  tlieoretiral  claim  of  the  two-level  framework  that  intermediate  levels  of  rejirewiitatiou  iuid  “feeding” 
rrIation«hi|i»  are  not  ueccMary  -  tbjkt  two  levels  suHicr,  in  other  words.  Series  rounertioii  of  the  automata 


i 


.  .  t  r  y  ♦  • 


t  r  1  e  • 


Figure  1;  The  automaton  component  of  the  KIMMO  !iyRti>m  conaists  of  several  two-headed  finite-state 
automata  that  inspect  the  lexical-surface  correspondence-  in  parallel.  Each  automaton  imposes  some 
constraint  on  the  correspondence.  The  automata  move  together  from  left  to  right,  (^om  Kart- 
tunen,  1083:176.) 


automata.  A  aet  of  several  automata  can  also  be  compiled  into  a  single  large  automaton  that 
will  run  faster  than  the  original  set,  though  its  size  may  be  prohibitive  (:176f). 


2.2.  The  Dictionary  Component 


The  dictionary  component  of  the  KlMMO  systent  is  divided  into  sections  called  Uxicona, 
wliich  are  all  ultimately  reachable  from  a  distinguished  root  lexicon.  In  the  dictionary-level 
processing  for  words  such  as  singing,  KiMMO  first  locates  the  lexical  form  sing  in  the  root 
lexicon.  The  mechanism  for  indicating  co-occurrence  restrictions  involves  listing  a  set  of  con¬ 
tinuation  lexicons  for  each  entry,  and  in  this  case  one  possibility  wiU  be  a  lexicon  that  contains 
♦ing.  In  the  actual  operation  of  the  KiMMO  system,  dictionary  processing  is  efficiently  inter¬ 
leaved  with  the  operation  of  the  automata  in  such  a  way  that  the  two  components  mutually 
constrain  their  operations. 

The  continuation-class  mechanism  that  the  KiMMO  dictionary  uses  to  encode  co-occurrence 
restrictions  among  roots  and  affixes  has  only  finite-state  power;  each  lexicon  corresponds  to  a 
state  in  a  tran.sition  nefw»>rk.  An  ijifuiy  people  have  noticed  {e.g.  Kjirttuiien,  1983:180;  Kart- 
tunen  and  Witteiiburg.  1983:222f),  such  a  design  miikes  it  difficult  or  impossible  to  express 
some  morphological  const r.iints.  In  the  future,  the  KiMMO  dictionary  component  will  almost 


would  imply  the  cxistnicc  of  iiifcrnuilmtc-  rr|m'sriifation  levels  at  the  interface  hetweeu  automata.  Beyontl  the 
question  of  coniptitatioiial  rliicieney,  the  theoretical  cl.-uiiii*  of  the  two-level  nuxlcl-will  not  he  evaluated  here. 
Possible  arRuiiieiitii  agtiiiist  tiieiii  cuiihl  involve  (a)  rule  orderings  with  depth  '  1.  (Ii)  p.’irticiilar  analyse.-*  in 
which  the  avail.ahility  of  only  two  levels  h-ails  to  rcduiidiuicy  in  the  automata,  and  (c)  multi-part  alternative 
representations  {e.g.  from  .’tutosegmental  theory)  that  iillow  a  more  illumintttiiiK  dcst  ription  of  various  liiiKuistic 
processes.  Otie  possible  argument  for  them  cotdd  involve  the  nuiltiplicity  of  possibilities  for  rule  ordering  in  a 
model  with  intermediate  di-rivational  steps. 


certainly  be  redesigned. 

The  automaton  component  rather  than  the  dictionary  component  of  the  KlMMO  system  is 
the  main  object  of  attention  here,  and  little  more  will  be  said  about  the  dictionary  component 
until  section  7.1. 

2.3.  Generation  and  Recognition 

A  KIMMO  system  docs  not  particularly  lean  toward  either  generating  or  reeognizing  the 
words  of  a  language.  Since  the  machines  of  the  automaton  component  just  express  constraints 
on  pemiis.sible  lexical -surface  correspondences,  they  can  serve  equally  well  to  determine  the 
lexical  form  of  a  surface  word  (recognition)  or  to  map  a  lexical  stem  with  affixes  into  the 
proper  surface  form  (generation).  The  only  major  difference  is  whether  the  process  is  driven 
by  the  surface  or  lexical  form.  However,  the  recognition  algorithm  is  slightly  more  complicated 
because  it  uses  the  lexicon  as  well  as  the  automata  to  constrain  the  analysis  of  an  input  word. 
(.■\s  Karttuneii  (1083:184)  notes,  it  would  require  only  a  simple  change  to  run  the  rccogniser 
without  the  constraints  of  the  stem  lexicon.  Such  a  mode  of  operation  would  be  useful  for 
stripping  recognisable  suffixes  from  unfamiliar  roots.) 


3.  The  Seeds  of  Complexity 


The  use  of  finite-state  machinery  gives  the  two-lcvcl  model  the  appearance  of  computar 
tioneil  efficicjicy,  but  in  the  worst  case  a  KiMMO  generator  or  recognizer  has  a  lot  of  work 
to  do.  This  section  probes  possible  sources  of  complexity,  wliile  the  next  section  will  exploit 
them  in  mathematical  reductions  that  answer  the  question  of  how  liard  KiMMO  generation 
and  recognition  can  be  in  the  general  case. 

3.1.  The  Lure  of  the  Finite-State 

At  first  glance,  the  KiMMO  system  raises  hopes  of  unfailing  efficiency.  Both  recognition 
and  generation  seem  to  be  a  matter  of  stopping  finite-state  machines  through  the  input  from 
left  to  right,  a  process  that  takes  only  a  quick  array  reference  or  so  per  character.  Any 
nondeterminism  that  might  arise  causes  Uttle  initial  concern,  since  methods  of  determinizing 
finite-state  machines  are  well-known.  Lexical  lookup  can  also  be  done  quickly,  character  by 
character,  interleaved  with  the  speedy  Icft-to-right  progress  of  the  automata: 

It  is  a  common  technique  to  represent  lexicons  as  letter  trees  because  it  minimizes 
the  time  spent  on  scarcliing  for  the  right  entry.  The  recognizer  only  makes  a  single 
left-to-right  pass  as  it  homes  in  on  its  target  in  the  lexicon.  (Karttunen,  1983:178) 

The  fundamental  efficiency  of  finite-state  machines  promises  to  make  the  speed  of  KiMMO 
processing  for  a  language  largely  independent  of  the  nature  of  the  constraints  that  the  automata 
encode: 

The  most  important  technical  feature  of  Koskenniemi's  and  our  implementation  of 
the  Two-level  model  is  that  morphological  rules  are  represented  in  the  processor  as 
automata,  more  specifically,  as  finite  state  transducers  ....  One  important  conse¬ 
quence  of  compiling  (the  grammar  rules  into  automata]  is  that  the  complexity  of  the 
Unguistic  description  of  a  language  has  no  significant  effect  on  the  speed  at  which 
the  forms  of  that  language  can  be  recognized  or  generated.  Tliis  is  due  to  the  fact 
that  finite  state  machines  arc  very  fast  to  operate  because  of  their  .simplicity  ....  Al¬ 
though  Finnish,  for  example,  is  morphologically  a  much  more  complicated  language 
than  English,  there  is  no  difference  of  the  same  magnitude  in  the  proce.ssing  times 
for  the  two  languages  ....  [This  fact)  has  some  psycholinguistic  interest  bt:cause  of 
the  common  sense  observation  that  we  talk  about  “simple"  and  “complex"  languages 
but  not  about  “fast"  and  “slow"  ones.  (:lC6f) 

In  order  for  the  automaton-based  two-level  model  to  be  of  i)sycholinguistic  interest  in  this 
way,  it  must  be  the  model  it.self  that  wipes  out  processing  difficulty,  rather  than  some  acci¬ 
dental  property  of  the  constraints  that  the  automata  encode.  In  much  the  same  vein,  Lind- 
stedt  (1984:171)  remarks  following  Koskeiinienii  that  “it  is  psychobnguisticjilly  interesting  to 
note  that  the  [two-level]  rules  arc  equivalent  to  such  coniputatiomdly  simple  iuid  effective  jt.e. 
efficient)  devices,”  again  picking  out  the  finite-state  machinery  iis  the  factor  resj)onsible  for 
computational  efficiency. 


3.2.  Sample  Recognizer  Behavior 

In  assessing  the  computational  characteristics  of  the  KiMMO  processing  algorithms,  it  is 
logical  to  begin  with  an  example.  Figure  2  shows  the  operations  that  a  KiMMO  recogniser 
for  English  goes  through  when  it  analyses  the  word  spisl.  From  inspecting  the  sequence  of 
lexical  forms  that  arc  considered,  it  is  clear  that  the  recogniser  docs  more  than  just  ^ding 
from  left  to  right  through  the  string. 

For  example,  at  step  7  the  recogniser  is  considerhig  the  lexical  string  spy-^,  y  surfacing  as 
i  and  as  e,  under  the  theory  that  the  input  word  might  be  a  plural  form  of  the  noun  spy  — 
spies  or  spies ' ,  that  is.  At  step  9  that  analysis  has  failed  to  pan  out  and  spy*  is  considered 
again,  this  time  with  *  coming  out  null  on  the  surface  instead  of  matching  the  input  e.  At 
step  11  the  recogniser  has  dropped  back  to  the  form  spy  that  it  was  considering  at  step  4,  this 
time  taking  the  root  as  a  verb.  All  of  the  spy  possibilities  ultimately  fail,  and  at  step  52  the 
recogniser  fintdly  tries  spi  instead,  repudiating  the  incorrect  choice  that  it  made  in  step  3.  In 
step  53  it  assumes  that  the  e  in  the  lexical  form  spie. . .  might  have  been  deleted,  but  this 
idea  soon  founders.  Fmally,  in  step  59  it  Snds  the  correct  lexical  entry  spiel. 

3.3.  Sources  of  Runtime  Complexity 

Traces  of  recognizer  operation  reveal  several  factors  that  combine  to  determine  the  overall 
computational  diliiculty  of  an  analysis.  The  recognizer  must  run  the  finite>atatc  machines  of 
the  automaton  component  and  descend  the  letter  trees  that  make  up  a  lexicon,  it  must  decide 
wliich  suffix  lexicon  to  explore  after  finding  a  root,  and  it  must  discover  the  correct  lexical- 
surface  correspondence. 

3.3.1.  Stepping  through  the  automata  and  the  lexicon 

First  of  all,  some  of  the  recognizer’s  activities  arc  concerned  with  the  mechanical  operation 
of  the  automata  and  the  letter  trees  of  the  lexicon.  Running  the  automata  is  expected  to 
be  fast;  there  arc  miuiy  wclbkiiown  fast  implementations  of  finite-state  machines,  differing 
somewhat  in  their  time  and  space  requirements.  Descending  a  letter  tree  should  also  be  easy, 
in  any  of  its  common  implementations. 

3.3.2.  Choosing  among  alternative  lexicons 

Second,  the  recognizer  often  makes  unfortunate  choices  about  the  path  that  it  should 
follow  througli  the  eollectinn  of  lexicons  in  the  dictionary  component.  Quite  a  few  nodes  in 
the  search  tree  of  Figure  2  repn'sent  chokes  among  alternative  lexicons  (LLL).  For  example, 
at  .step  11  the  recognizer  may  .s<;arch  any  of  several  lexicons  next;  the  lexicon  I  that  encodes 
the  fart  that  the  present  indicative  of  a  verb  may  have  no  added  ending,  the  lexicon  AG  that 
rontains  the  ageiitive  ending  ♦er,  or  one  of  several  other  lexicons  that  contain  ^ed  and  other 
intliH'tional  endings. 

The  .xearch  for  a  path  through  the  suffix  lexicons  of  the  dictionary  component  can  take 
consi(i<!rnble  time  in  the  current  KiMMO  iiiipleiiientation.  However,  stich  wandering  can  be 


10 


*•»! 


2 

*P 

1.1. 1.2. 1.1 

3 

*py 

1.3. 4. 3. 1.1 

4 

■spy*  ends. 

nets  lexicon  N 

S 

■0*  ends. 

new  lexicon  Cl 

6 

*py 

XXX  extra  Input 

7 

(8) 

spy* 

1.6.16.4,1.1 

8 

spy* 

XXX 

9 

(6) 

spy* 

1.6. 1.4. 1.1 

10 

spy* 

XXX 

11 

(<) 

■spy*  ends. 

new  lexicon  I 

12 

spy 

XXX  extra  Input 

13 

(4) 

■spy*  ends. 

new  lexicon  P3 

14 

spy* 

1.6. 1.4. 1.1 

IS 

spy* 

XXX 

16 

(»<) 

spy* 

1,6.18.4.1.1 

17 

spy* 

XXX 

18 

(<) 

■spy*  ends. 

new  lexicon  PS 

19 

spy* 

1.6, 1.4. 1,1 

20 

spy** 

1.1. 1,1. 4,1 

21 

spy*e 

XXX 

22 

(20) 

spy** 

1.1. 4. 1,3.1 

23 

spy** 

XXX 

24 

(19) 

spy* 

1.6.16.4,1.1 

2S 

spy** 

XXX  Epenthesis 

26 

(A) 

"spy*  ends. 

new  lexicon  PP 

27 

spy* 

1.6. 1.4. 1.1 

28 

spy** 

1.1. 1.1. 4.1 

29 

spy** 

XXX 

30 

(28) 

spy** 

1.1.4..1.3.1 

31 

spy+e 

XXX 

32 

(27) 

spy* 

1.6.16.4.1.1 

33 

spy** 

XXX  Epenthetls 

34 

(4) 

"spy*  ends. 

new  lexicon  PR 

35 

spy* 

1.6. 1.4. 1.1 

36 

spy* 

XXX 

37 

(36) 

spy* 

1.6,16.4.1.1 

38 

spy* 

XXX 

39 

{*) 

"spy*  ends. 

new  lexicon  AG 

40 

spy* 

1.6. 1.4. 1.1 

41 

spy** 

1.1. 1.1. 4.1 

42 

spy** 

XXX 

43 

(41)  spyft 

1.1. 4.1. 3.1 

44 

spy** 

XXX 

45 

(40) 

spy* 

1,6,16.4,1,1 

46 

spy*e 

XXX  Epenttiesis 

47 

(4) 

"spy*  ends. 

n;w  lexicon  AB 

48 

spy* 

1.6. 1.4, 1.1 

49 

spy* 

XXX 

SO 

(48) 

spy* 

1.6.16.4.1.1 

51 

spy* 

XXX 

52 

(3) 

spi 

1,1, 4. 1.2. 6 

63 

spi* 

1.1,16.1.6.1 

54 

spi* 

XXX 

55 

(63) 

spi* 

1.1.16.1.6.6 

56 

spiel 

1.1,16.2.1,1 

67 

"spiel*  ends. 

new  lexicon  N 

58 

"0*  ends. 

new  lexicon  Cl 

59 

"spiel" 

•••  result 

60 

(68) 

sp1e1+ 

1.1.16,1,1.1 

61 

spiel* 

XXX 

-+LLL+LLL+ni+ 

I 

---+XXX+ 

---♦XXX+ 

LLL+III+ 

I 

LLL+---+XXX+ 
j  ---+XXX+ 
LLL+---+---+XXX+ 

1  I  J-+XXX4- 


LLL+---+-— tXXX^- 
j  ---+XXX+ 
---+AAA+ 
LLL+---+XXX+ 

I  I 

---♦XXX+ 

LLL+---+---+XXX+ 
i  I  -1-.XXX. 


— -+AAA+ 
LLLt---+XXX» 
---♦XXX» 
-"♦---♦XXX+ 


--♦---♦ILI+LLL+**** 

-”+XXX» 


Kty  to  tro*  nodot: 

normal  trovertal 
ILL  new  lexicon 

AAA  blocking  by  automoto 

XXX  no  lexlcal-surfaca  pairt 
compatible  «1th  surface 
char  and  dictionary 
III  blocking  by  leftover  Input 

•••  analysis  found 


((■spier  (N  SG))) 

Figure  2:  Theae  traces  show  the  steps  that  the  KiMMOrecogiiiser  for  English  goes  through  while 
analyzing  the  surface  form  spiel.  Each  Unc  of  the  table  on  the  left  shows  the  lexical  string  and 
automaton  states  at  the  end  of  a  step.  If  some  automaton  blocked,  the  automaton  states  are  replaced 
by  an  XXX  entry.  An  XXX  entry  with  no  autoninlou  niuae  indicates  that  the  lexical  string  could  not 
be  extcn<led  because  the  surface  clniractcr  and  lexical  letter  tree  together  ruled  out  all  feasible  pairs. 
After  an  XXX  or  •••  entry,  llie  recognizer  backtriirks  and  picks  up  from  a  previous  choice  point, 
indicated  by  the  piventhesized  step  number  before  the  lexical  string.  The  tree  on  the  right  depicts 
the  search  graphically,  reading  from  left  to  right  and  toji  t<i  bottom  with  vertical  l)iirs  linking  the 
choices  at  each  choice  point.  The  figures  were  generatetl  with  a  KiMMO  uiipleiiieiitatioii  written  in  an 
augmeiitecl  version  of  MA(IIJSPbase<l  initially  on  Karttunen's  1 1083:18211)  iJgorithm  description;  the 
dictionary  >uid  automaton  couipoiients  for  English  were  tak:’n  from  Karttiinen  and  Witteidiurg  (1983) 
with  minor  chiuiges.  This  iiiiideiiientation  searches  <1e]>th-iirst  as  Karttunen's  does,  but  explores  the 
alternatives  at  a  given  <lepth  in  a  <li(ferent  order  from  Karttunen’s. 


Recognizing  surface  form  *sp1e1*. 

1 

s 

1.4. 1.2. 1.1 

Z 

sp 

1.1. 1.2. 1.1 

3 

spy 

1.3. 4. 3. 1.1 

4 

”spy*  ends. 

new  lexicon  (/■) 

6 

*0"  ends. 

new  lexicon  (Cl) 

6 

*py 

XXX  extra  Input 

7 

(6) 

spy* 

1,5.16.4.1.1 

6 

spy* 

XXX 

9 

(6) 

spy* 

1.6. 1.4. 1.1 

10 

spy* 

XXX 

11 

(4) 

’spy’  ends. 

new  lexicon  (/V) 

12 

spy 

XXX  extra  Input 

13 

♦ 

spy* 

1.6. 1.4. 1.1 

14 

spy*e 

1,1.1. 1.4.1 

IS 

spy*# 

XXX 

16  (14) 

spy*e 

1.1. 4. 1.3.1 

17 

spy*# 

XXX 

18 

(12) 

spy* 

1.5.16. 4. 1.1 

19 

spy*# 

XXX  Epenthesis 

20 

(3) 

spl 

1.1. 4. 1.2. 6 

21 

sple 

1.1,16.1.6.1 

22 

sple 

XXX 

23 

(21) 

sple 

1.1,16,1.6.6 

24 

spiel 

1.1.16.2.1.1 

25 

’spiel’  ends. 

new  lexicon  (/M) 

26 

’0’  ends. 

new  lexicon  (Cl) 

27 

’spiel’ 

•••  result 

28 

(26) 

spiel* 

1.1.16.1.1,1 

29 

spiel* 

XXX 

((" 

spiel 

’  (N  S6))) 

ire  3: 

The  dictionary  modification  that  will  be  < 

♦LLL+LLUIII* 
---♦XXX+ 
---♦XXX+ 

---♦—♦XXX* 
-—♦XXX* 


—♦AAA* 


—♦—♦XXX* 

---♦— ♦LLL*LLL***»* 


---♦XXX* 


oRniziT  to  make  fewer  choicea  among  kxicone.  These  traces  hUoi*  the  steps  that  the  recogniscr  goet 
througli  in  the  anailysis  of  spiel  when  the  tnerged  dictionary  is  used;  the  nnmber  of  lexicon-choice 
nudes  (LLL)  is  lower  than  in  Figure  2.  The  names  of  the  merged  lexicons  are  written  in  parenthe¬ 
sized  fonii  to  indicate  that  each  one  actually  represents  a  class  of  lexicons  in  the  crigina]  dktioaary 
description.  A  *  entry  in  the  backtracking  column  indicates  backtracking  from  an  immediate  failure 
in  the  previous  step,  which  does  not  require  the  full  backtracking  mechanism  to  be  invoked. 


sharply  reduced  by  merging  the  lexicons  in  such  a  way  that  several  lexicons  can  be  searched 
in  parallel;  section  7.1  will  explain  in  detail.  Meanwhile,  taking  this  improvement  for  granted 
will  make  it  possible  to  sidestep  the  problem  and  fociu  on  other  processes.  With  the  merged 
dictionary,  Figure  3  shows  that  the  nnml>er  of  lexicon  -choice  alternatives  in  the  search  tree  few 
spiel  is  reduced  from  8  to  2,**  cutting  the  total  number  of  steps  from  61  to  29.  (The  chmee 
between  spy-noun  and  spy-verb  remains  because  it  would  be  directly  reflected  in  the  output, 
but  the  purely  internal  choices  among  the  lexicons  for  different  verbal  endings  are  eliminated.) 


3.3.3.  Finding  the  lexical-surface  correepondence 

Finally,  some  of  the  backtracking  results  from  local  ambiguity  in  the  ronstrtiction  of  the 
IrTie.al  mrJaee  eorrenpondcner.  Even  if  only  one  iKUMibility  is  glolutOy  ronipatihlc  with  the 
roiKHtruiiits  iinp<isod  by  the  lexicon  and  the  automata,  there  may  not  be  enough  evidemee  at 
every  pcuiit  in  processing  to  clioose  the  correct  lexical  surface  pair;  search  behavior  results. 


' '  Tticm!  figure*  count  LLL  notlr*  excluding  unoinbiguou*  cliaiccs. 


t/t 


p/p 


y/i 


LLL4’IinilI+ 
. ♦XXXXXKX* 

.lo 

. tXXXXXXX* 


LUL^LLMIIIIII')' 

*lo  t/O 

. + . tXXXXXXX* 


.1. 


-♦XXXXXXX+ 


+/•  I/O 

. +AAAAAAA+ 


1/1 


•/O 

-♦ . ♦XXXXXXX+ 


1/1  (/Ml  jCl) 
-♦LLLLLLL+LLLLL 


Jo 


LL+* 


-tXXXXXXX+ 


(("ipl.l"  (M  S6))) 

Figure  4:  This  expanded  rersion  of  the  search  tree  from  Figure  3  shows  what  hypothesis  the  KIMMO 
recogniter  is  entertaining  along  each  path,  during  the  analysis  of  spiel  with  a  merged  dictionary. 


Figure  4  displays  the  search  graphically  with  an  expanded  version  of  the  mcrgcd-lcxicon  search 
tree  from  Figure  3,  annotated  with  information  about  the  specific  choices  the  rccognieer  has 
at  each  point. 

Thus,  after  seeing  the  surface  characters  spl . . . ,  the  recogniser  did  not  have  enough 
evidence  to  choose  between  the  lexical  possibilities  spy . . .  and  spi . . . ,  even  though  only 
one  analysis  was  possible  for  the  complete  input  spiel.  During  exploration  of  the  spy. . . 
possibility  in  the  (/V)  lexicon,  there  was  uncertainty  about  the  pairs  */0,  */e,  e/0,  and 
e/e.  It  proved  unprofitable  to  explore  those  regions  of  the  tree  in  the  analysis  of  spiel,  but 
Figures  5  and  6  show  that  the  correct  analysis  can  lie  in  those  regions  for  other  words. 

Similarly,  in  analysing  the  word  rubbish  (Figure  7),  the  recogniser  ctuinot  tell  after 
seeing  only  rubb. .  .  whether  the  lexical  string  is  rubb. . .  as  in  rubbish  or  rub‘». . .  as  in 
rub'^'ing  '*■>  rubbing.  In  fact,  it  briefiy  considers  the  possibility  that  surface  r. . .  mi^t 
correspond  to  lexical  re’ .  . .  as  in  the  strcss-niarkcd  lexical  representation  re'fer,  but  it 
quickly  discovers  that  the  right  context  for  licensing  the  e/0  pair  is  absent.  (Recall  frenn 
section  2.1.1  how  a  KiMMO  automaton  implements  a  change  that  depends  on  riglit  context: 
initially  it  permits  the  rhange<l  j)air  in  the  exp«-rtation  that  tl»e  pn»per  right  context  will  be 
found,  and  upon  processing  the  changed  pair,  it  enters  a  state-sequence  that  will  cventuaUy 
block  without  the  necessary  right  context.) 

In  these  ca.ses,  misguided  search  subtrees  did  not  get  very  deep  —  largely  because  the 
relevant  spelling-change  processes  were  local  in  character.  Ijong-distance  harmony  processes 
are  also  jiossible  (§1.2).  and  thus  there  can  potentially  be  a  long  intervid  before  the  acceptability 
of  a  lexical-  surface  pair  is  ultini{it<'ly  deterinined.  For  example,  when  vowel  alternations  within 
a  verb  stem  iire  conditioned  by  the  <iccurrence  of  particular  tense  suliix<‘s.  it  may  be  ut'ccssary 


I/s 


p/p 


,/1 


(/* 

-♦ll 


uLlll«1uIlll4'I  1 1  n  I  i«^ 

4. 

. «XI»IIX1« 

4o 

. ♦xxxxxxx*^ 


lai 


UUUL«11III1I4’ 

4o 


•/o 

. +xxxxxxx«^ 

•It  d/d 


•/o 


1/1  •/© 

. ♦ . •«xxxxxxx» 

.1. 

. »XXXXXXX«' 

(("sp»**d-  (V  PAST  PUT))  (*tpy««d*  (V  PAST))) 

Figure  5:  The  teweh  tree  for  apiod  »  ainnlar  to  the  search  tree  for  apiel  (Figure  4),  but  the  aolutioo 
Uea  in  a  different  region  at  the  tree.  Neither  part  of  the  aearch  can  be  eliminated,  aince  cither  one  maj 
contain  the  aniution. 


s/t  p/p  y/1  (/«}  (Cl) 

. -♦-i. — ♦uiuttHLiiLLL+ininu 


I 


*/»  t/s  (CZ) 
. ♦ . ♦LIULLL4 

4o 

- 4.XXXXXXX* 


(/V) 

UUl 


UUtU«IlllIIIt 


♦lo  e/O 

-♦ . 4’XXXXXXX+ 


♦/• 


A 

•/o 

uuuu 


-+xxxxxxx« 


1/1 


•/o 


+XXXXXXX* 

*XXXXXXX4^ 


((’ipyM*  (V  PRES  SG  3RD))  ('spy«^t*  (N  PL))) 

Figure  0;  In  the  analysis  of  apiaa.  the  location  of  the  solution  in  the  search  tree  is  different  from  its 
location  for  spiel  (Figure  4)  or  'spied  (Figwe  5).  Thus  none  uf  the  three  main  regions  of  the  tree 
can  be  pruned  from  the  search. 


Racognizing  surface  form  "rubbish*. 


1 

r 

1.1. 1.2. 1.1 

12 

+ 

rub'll 

1.1. 1.1. 2, 6 

2 

ra 

1.1. 1.1. 4.1 

13 

rub+1 

XXX 

3 

r*’ 

XXX  Elision 

14 

(fi) 

rubb 

1,1.16.2.1.1 

4 

(2) 

ru 

1.1. 4. 1.2.1 

16 

rubbi 

1,1,16,1,2.6 

6 

rub 

1.1. 6. 2. 1.1 

16 

rubble 

1.4,16.2.1.1 

6 

"rub*  ends. 

new  lexicon  (/V) 

17 

rubbish 

1.3,16,2,1.1 

7 

rub 

XXX  extra  Input 

18 

"rubbish" 

ends,  new  lexicon  i 

8 

♦ 

rub+ 

1.1. 3. 1.1.1 

19 

*0"  ends. 

new  lexicon  | 

9 

rub-t-a 

XXX  Gemination 

20 

"rubbish" 

•••  result 

10 

(7) 

rub+ 

1.1. 2. 1.1.1 

21 

(19) 

rubbish't 

1.6,16.1,1.1 

11 

rub-t'S 

XXX  Gemination 

22 

rubblsh't^ 

XXX 

(("rubbish*  (N  S6))) 

Figure  7:  While  analysing  the  surface  fomi  rubbish,  the  KiM MO  recognizer  is  temporiuily  misled 
(i)  by  the  possibility  that  a  lexical  e*  itiight  have  been  deleted  at  the  surface  and  (ii)  by  the  possibility 
that  the  surface  bb  might  have  resulted  from  doubling  of  a  single  uniierlyiiig  b.  However,  in  each  case 
the  possibility  fails  to  pan  out.  (Refer  to  Figure  2  for  an  explanatiop  of  the  table  format.) 


to  SGc  the  end  of  the  word  before  making  final  decisions  about  the  stem.^*  The  possibility  cd' 
a  long  period  of  uncertainty  forms  the  basis  for  the  reductions  in  section  4. 

3.4.  Search  and  Verification 

Setting  aside  until  section  7.1  the  problem  of  choosing  among  alternative  lexicons,  it  is 
easy  to  see  that  the  use  of  finite-state  machinery  helps  control  only  one  of  the  two  remaining 
sources  of  complexity.  Stepping  the  automata  should  be  fa.st,  but  the  finite-state  framework 
does  not  guarantee  speed  in  the  task  of  guessing  the  correct  lexical-surface  correspondence. 
The  search  required  to  find  the  correspondence  may  predominate. 

In  fact,  the  KiMblO  recognition  and  generation  problems  bear  an  ominous  resemblance 
to  problems  in  the  computational  class  ^ P.  MP  consists  of  the  problems  that  can  be  solved 
on  a  >/ondeterministic  Turing  machine  within  .Polynomial  time.  Informally,  a  problem  in  XP 
has  a  solution  that  may  be  hard  to  guess  (hence  the  use  of  nondctcrministic  machines)  but  is 
easy  to  verify  (in  polynomial  time): 

[Informally,]  we  view  [a  nondetcrministic  algorithm]  as  being  composed  of  two  sep¬ 
arate  stages,  the  first  being  a  guesting  stage  and  the  second  a  cheeking  stage  .... 
(Garey  and  Johnson,  1979:28) 

It  should  be  evident  that  a  ‘polynomial  time  nondetcrministic  algorithm”  is  basically 
a  definitional  device  for  capturing  the  notion  of  polynomial  time  verifiability,  rather 
than  a  realistic  method  for  solving  decision  problems.  (:29) 

This  difference  in  difficulty  between  guessing  and  verification  seems  to  fit  the  KiMMO  frame¬ 
work:  the  finite-state  twivlevel  automata  can  verify  a  solution  quickly,  but  it  may  still  be  hard 
to  guess  the  correct  lexictil- surface  correspondence. 

'^Siurp  long-distance  right  context  is  part  of  the  problem,  it  has  l>ecn  suggested  that  KIM  MO  processing  in 
the  problematic  cases  would  be  cosier  if  carried  init  from  right  to  left.  However.  )he  nuire  roiniiiou  left  context 
would  then  cause  difiinilties,  and  what  couhl  be  done  ulunit  mixed  rule  systems  ill  which  both  left  and  right 
context  play  a  role?  In  fact,  the  reductions  in  section  4  show  that  no  simple  fix  will  hel|i  in  the  general  case. 


It  is  not  always  iqiparent  from  local  evidence  how  to  construct  a  Icxical-snrface  cone* 
spondcnce  that  wiD  satisfy  the  constraints  imposed  by  a  set  of  two-lcvrl  automata;  thus  the 
KimMO  algorithms  contain  the  seeds  of  complexity.  The  next  sections  will  exploit  those  seeds 
in  mathematical  reductions  that  prove  KlMMO  recognition  and  generation  are  cmnpntation* 
ally  difficult  in  the  worst  case.  The  finite^tate  two-level  framework  itself  does  not  guarantee 
computational  efficiency. 


16 


4.  The  Complexity  of  Two-Level  Morphology 

The  reductions  in  this  section  show  that  two>leve]  automata  can  describe  computationally 
difficult  problems  in  a  very  natural  way.  It  follows  tliat  the  two-level  framework  itself  cannot 
guarantee  computational  efficiency.  If  the  words  of  natural  languages  are  easy  to  analyse, 
the  efficiency  of  processing  must  result  from  some  additional  property  that  natural  languages 
have,  beyond  those  that  arc  captured  in  the  two-level  model.  Otherwise,  computationally 
difficult  problems  might  turn  up  in  the  two-level  automata  for  some  natural  language,  just  as 
they  do  in  the  artificially  constructed  languages  here.  In  fact,  the  reductions  arc  abstractly 
modeled  on  the  KlMMO  treatment  of  harmony  processes  and  other  long-distance  dependencies 
in  natural  languages  (sec  §§3.3.3,1.2). 

4.1.  The  SAT  Problem 

The  reductions  involve  versions  of  the  Boolean  satisfiability  problem  (SAT).  An  instance 
of  SAT  consists  of  a  Boolean  formula  in  conjunctive  itormal  form  (CNF),  and  the  question  to 
be  answered  is  whether  there  is  a  way  of  assigning  values  (T,F)  to  the  variables  so  that  the 
formula  comes  out  true.  Thus  the  formulas 

X 

(z  V  y)k{x  V  y) 

(z  V  v)tt{y  V  z)tt(y  V  ?)lk(z  V  y  V  s) 
are  satisBable,  while  the  formulas 

xtix 

(z  V  y)dc(z  V  y)bx 

(z  V  y  V  s)&(z  V  l)&(z  V  z)b{y  V  s)&(y  V  z)b(z  V  y) 

are  unsatisfiable.  The  SAT  problem  is  >/P-complctc  and  thus  computationally  difficult.  The 
related  problem  3SAT  is  a  restricted  case  of  SAT  in  which  every  disjunction  must  have  exactly 
three  disjiincts.  (This  restricted  form  of  CNF  is  known  as  3CNF.)  3SAT  is  also  >/ ^-complete, 
though  2SAT  is  not.^* 

4.2.  KiMMO  Generation  is  M  P -Hard 

It  is  easy  to  encode  an  arbitrary  SAT  problem  as  a  KiMMO  generation  problem.  The 
general  problem  of  majiping  from  lexical  to  surface  forms  in  KlMMO  systems  is  therefore  MP- 
hard,  i.e.  >/F-complete  or  worse  (see  section  6).  Formally,  define  a  possible  instance  of  the 
computational  problem  KlMMO  GENERATION  as  any  pmr  (A,  (t).  where  A  is  the  automaton 
component  of  a  KlMMO  system  specified  as  in  Gajek  et  al.  (1983)  <uid  is  a  string  over  the 
alphabet  of  the  KlMMO  system.  An  actual  instance  of  KlMMO  GENERATION  will  be  any 

''^For  more  eztriiNvr  tlirorrtiml  ilisciiiisioue  of  efficient  proccasability,  are  Derwick  and  Weinberg  (1082), 
Burton  (1085h),  and  rrfereurea  cited  therein. 

'■’SAT  Wiia  the  firat  problem  to  be  provnl  >/F-conipIete  (Cook 'a  Theorem.  1971).  The  A/F-conipIeteneae  of 
3SAT  is  also  well-known.  For  detiuls,  see  Oarey  and  Johnson  (1070)  or  any  atondiird  textbook. 


"z-consiatencf' 

XX* 

T  F  - 
1:  2  3  1 

2:  2  0  2 

3:  0  3  3 


3  3 

(lexical  eharaeter$) 
(cur/aee  eharaeterc) 
(x  undecided) 

(x  true) 

(x  faUe) 


Figure  8:  The  KlM  MO  generator  system  that  encodes  a  SAT  foranila  ip  should  include  a  consistency 
automaton  of  this  form  for  eyery  variable  z  that  occurs  in  p.  The  coiisLstency  automaton  constrains 
the  mapping  from  variables  in  the  lexical  string  to  truth>valuc8  in  the  surface  string,  ensuring  that 
wliatever  value  is  assigned  to  z  in  one  occurrence  must  be  assigned  to  z  in  every  occurrence. 


"satisfaction"  3  4 


T  F  -  . 
1.  2  1  3  0 
2:  2  2  2  1 
3.  12  0  0 


(lexical  characters) 

(surface  characters) 

(no  true  seen  in  this  group) 
(true  seen  in  this  group) 

(•F  counts  as  true) 


Figure  9:  The  SAT  generator  system  for  any  formula  should  include  this  satisfaction  automaton,  which 
determines  whether  the  truth  values  assigned  to  the  variables  cause  the  formula  to  come  out  true. 
Since  the  formula  is  in  CNF,  the  requirement  is  that  the  groups  between  commas  must  all  contain 
at  least  oue  true  value.  In  state  i,  no  true  value  has  been  seen;  F  cycles,  while  T  goes  to  state  2  to 
wait  for  the  comma  that  begins  the  next  group.  State  3  remembers  a  preceding  minus  sign  so  that 
-F  can  count  as  true.  Only  state  2  is  a  final  state  because  only  state  2  indicates  that  a  true  value  has 
occurred. 


possible  instance  {A.  a)  such  that  for  some  <r',  the  lexical-surface  pair  a  fa'  satisfies  the  con- 
stniints  uiiposed  by  the  automata  in  A.  Thus  {A,a)  is  an  instance  of  KIMMO  GENERATION 
if  there  is  any  surface  string  that  can  be  gemerated  from  the  lexical  string  a  according  to  the 
automata.  (As  the  problem  is  dt'fincd,  an  algorithm  is  not  required  to  exhibit  the  surface 
strings  that  can  be  generated,  but  only  to  say  whether  there  are  any.) 

To  encode  a  SAT  problem  ^  as  a  p<ur  {A,<t),  first  construct  o  from  the  CNF  for¬ 
mula  ^  by  a  notational  translation.  Use  a  minus  sign  for  negation,  a  comma  for  coi^unc- 
tion,  and  no  explicit  operator  for  disjunction.  Then  the  o  corresponding  to  the  formula 
(z  V  y)k{y  V  x)Si{x  V  y  V  z)  is  -xy.-yz.xyz.  The  notation  is  unambiguous  without  paren¬ 
theses  because  <p  is  required  to  be  in  CNF. 

Second,  construct  A  (in  polynomial  time)  in  three  parts.  (A  varies  from  formula  to  formula 
only  when  thcformuhis  involve  diiferent  sets  of  variables.)  The  alphabet  specification  should  list 
the  variables  in  o  together  with  the  .special  characters  T,  F,  minus  sign,  and  comma.  The  equals 
sign  should  be  declared  a.s  the  KiMMO  wildcard  character,  as  usiuU.  The  consistency  automata, 
one  for  each  variable  m  <t,  should  l>c  cotistnirte<l  as  in  Figure  8.  The  satisfaetion  automaton 
should  be  ro|>ied  from  Figure  9  and  iloes  not  vary  from  formula  to  formula.  Figure  10  lists 
the  entire  SAT  generator  system  A  for  formulas  that  use  variables  x,  y,  and  z. 

The  generator  system  used  in  this  construction  is  set  up  so  that  surface  strings  are  identical 


ALPHABET  x  y  z  T  F  -  .  . 


AMY  • 

EMO 


Figure  10:  This  is  the  complete  KiMMO  generator 
system  for  solving  SAT  problems  in  the  variables 
X,  y,  and  x.  The  system  includes  a  consistency  an- 
tomatou  for  each  variable  in  addition  to  a  satisfac¬ 
tion  automaton  that  does  not  vary  from  problem 
to  problem. 


’x-conslxtency'  3  3 

XX* 

T  F  • 

1:  2  3  1 

2:  2  0  2 
3:  0  3  3 

■y-eonxlttsney"  3  3 

y  y  • 

T  F  • 

I;  2  3  1 

2:  2  0  2 
3:  0  3  3 

'z-contlstency*  3  3 
z  z  • 

T  F  ■ 

1:  2  3  1 

2:  2  0  2 
3:  0  3  3 

'satlifsctlon*  3  4 
■  ■  * 

T  F  -  . 

1.  2  1  3  0 

2:  2  2  2  1 
3.  1200 


END 


to  lexical  strings,  but  with  truth  values  substituted  for  the  variables.  Thus  any  surface  string 
generated  from  a  will  directly  cxliibit  a  satisfying  truth-assignment  for  ip.  The  consistency 
automaton  for  each  variable  x  ensures  that  the  value  assigned  to  x  is  consistent  throughont 
the  string.  In  state  1,  no  truth- value  has  been  assigned  and  either  x/T  or  x/F  is  acceptable. 
In  state  2,  x/T  has  been  chosen  once  and  therefore  only  x/T  can  be  permitted  for  other 
occurrences  of  x.  Similarly,  state  3  allows  only  x/F.  All  of  the  states  of  the  x-consistency 
automaton  ignore  punctuation  marks  and  variables  other  than  x.  The  satisfaction  automaton 
blocks  if  any  disjunction  contains  only  F  and  -T  after  truth-values  have  been  substituted  for  the 
variables;  thus  the  satisfaction  automaton  wiU  end  up  in  a  final  state  only  if  the  truth-values 
that  have  been  assigned  satisfy  every  disjunction  and  hence 

The  net  result  of  the  constraints  imposed  by  the  comsistency  and  satisfaction  automata 
is  that  some  surface  string  can  be  generated  from  o  just  in  case  the  original  formula  <p  has 
a  satisfying  truth-assignment.  Fiuthermore,  the  pair  (A.o)  can  be  constructed  in  time  poly¬ 
nomial  in  the  length  of  <p\  thtis  SAT  is  polynomial-tune  reduced  to  KIMMO  GENERATION, 
and  the  general  case  of  KIMMO  GENERATION  is  at  least  as  hard  as  SAT.  Figure  11  traces 
the  operation  of  the  KiMMO  generation  algorithm  on  a  satisfiable  formula;  note  that  the  gen¬ 
erator  goes  through  quite  a  bit  of  search  even  though  there  turns  out  to  be  only  one  answer. 
Figure  12  shows  what  happens  with  an  unsatisfiablc  formula. 


4.3.  KiMMO  Recognition  is  .VP- Hard 

Like  the  generator,  the  KiMMO  recognizer  can  be  used  to  solve  computationally  diffi¬ 
cult  problems.  KiMMO  recognition  and  KiMMO  generation  are  both  A'P-lmrd.  To  treat  the 
recognizer  formnlly,  define  a  possible  instance  of  the  roinputatiomd  problem  KIMMO  RECOG¬ 
NITION  as  any  triple  {/I,  where  A  and  (t  are  as  before,  <uid  D  is  the  dictionary  compo¬ 

nent  of  a  KiMMO  system  described  as  sjiecified  in  Gajek  ct  nl.  (1983).  An  actual  instance  of 
KIMMO  RECOGNITION  will  be  Juiy  p«»ssible  instance  {A,  D.tt)  such  that  for  some  <r',  (i)  the 
lexical- surfju’e  pair  a' fo  satisfies  the  constniiiits  imposed  by  the  automata  in  A  as  before, 
and  (ii)  <t'  can  be  generated  by  the  dictionary  component  D.  Thus  {A,D,tr)  is  an  instance  of 


19 


Generating  from  lexical  form  "-»y .-yr . -y-7.»y/' . 


1 

- 

1.1. 1.3 

38 

4 

-FF. -FT.-F-T, FFT 

3. 3. 2. 2 

2 

-F 

3. 1.1.2 

39 

•-FF. -FT.-F-T. FFT" 

•••  rtsult 

3 

-FF 

3. 3. 1.2 

40 

(3) 

-FT 

3. 2. 1,2 

4 

-FF, 

3. 3. 1.1 

41 

-FT. 

3.2. 1,1 

6 

-FF. 

- 

3. 3. 1.3 

42 

-FT.- 

3.2, 1.3 

6 

-FF, 

-T 

XXX  y-cos. 

43 

-FT.-F 

XXX  y-coR. 

7 

♦ 

-FF, 

-F 

3.3. 1.2 

44 

4 

-FT.-T 

3, 2. 1.1 

S 

-FF. 

-FF 

3. 3. 3. 2 

46 

-FT.-TF 

3.2.3. 1 

9 

-FF. 

-FF, 

3. 3. 3.1 

46 

-FT.-TF, 

XXX  satis. 

to 

-FF. 

-FF.- 

3. 3. 3. 3 

47 

(46) 

-FT,-TT 

3, 2. 2, 2 

11 

-FF. 

-FF,-T 

XXX  y-con. 

48 

-FT.-TT, 

3.2.2. 1 

12 

•f 

-FF. 

-FF.-F 

3. 3. 3. 2 

49 

-FT.-TT.- 

3. 2, 2, 3 

13 

-FF. 

-FF.-F- 

3. 3. 3. 2 

50 

-FT,-TT.-F 

XXX  y-coR. 

14 

-FF. 

-FF.-F-T 

XXX  z-con. 

61 

4 

-FT.-TT.-T 

3. 2. 2.1 

IS 

-FF. 

-FF.-F-F 

3. 3. 3. 2 

62 

-FT.-TT.-T- 

3. 2. 2. 3 

16 

-FF. 

-FF.-F-F. 

3. 3. 3.1 

63 

-FT.-TT.-T-F 

XXX  Z-COR. 

17 

-FF. 

-FF.-F-F. T 

XXX  X-COR. 

64 

4 

-FT,-TT.-T-T 

3.2.2. 1 

18 

* 

-FF. 

-FF.-F-F. F 

3, 3. 3.1 

66 

-FT,-TT,-T-T. 

XXX  satis. 

19 

-FF. 

-FF.-F-F. FT 

XXX  y-coR. 

66 

(2) 

-T 

2. 1.1.1 

20 

♦ 

-FF. 

-FF.-F-F. FF 

3. 3. 3.1 

67 

-TF 

2. 3, 1.1 

21 

-FF. 

-FF.-F-F. FFT 

XXX  Z-COR. 

66 

-TF. 

XXX  satis. 

22 

•f 

-FF. 

-FF.-F-F. FFF 

3. 3. 3.1 

69 

(67) 

-TT 

2.2. 1.2 

23 

-FF. 

-FF.-F-F, FFF 

XXX  satis.  Rf. 

60 

-TT. 

2.2. 1.1 

24 

(8) 

-FF. 

-FT 

3, 3. 2. 2 

61 

-TT.- 

2. 2. 1.3 

25 

-FF. 

-FT. 

3. 3. 2.1 

62 

-TT.-F 

XXX  y-cOR. 

26 

-FF. 

-FT.- 

3. 3. 2. 3 

63 

4 

-TT.-T 

2,2. 1.1 

27 

-FF. 

-FT.-T 

XXX  y-cOR. 

64 

-TT.-TF 

2.2.3, 1 

28 

♦ 

-FF, 

-FT.-F 

3. 3.2.2 

66 

-TT.-TF. 

XXX  satis. 

29 

-FF. 

-FT.-F- 

3. 3. 2. 2 

66 

(64) 

-TT.-TT 

2, 2. 2. 2 

30 

-FF, 

-FT.-F-.F 

XXX  z-con. 

67 

-TT.-TT. 

2.2.2, 1 

31 

4- 

-FF. 

-FT.-F-T 

3. 3.2. 2 

68 

-TT.-TT,- 

2. 2. 2. 3 

32 

-FF. 

-FT.-F-T. 

3. 3.2.1 

69 

-TT.-TT,-F 

XXX  y-coR. 

33 

-FF. 

-FT.-F-T,T 

XXX  X-COR. 

70 

4 

-TT. -TT.-T 

2.2.2. 1 

34 

•f 

-FF. 

-FT.-F-T.F 

3. 3, 2,1 

71 

-TT.-TT,-T- 

2. 2. 2. 3 

3S 

-FF. 

-FT.-F-T. FT 

XXX  y-cOR. 

72 

-TT.-TT,-T-F 

XXX  Z-COR. 

36 

4 

-FF. 

-FT.-F-T.FF 

3. 3. 2.1 

73 

4 

-TT.-TT. -T-T 

2.2,2. 1 

37 

-FF. 

-FT.-F-T. FFF 

XXX  Z-COR. 

74 

-TT.-TT, -T-T. 

XXX  satis. 

("■ 

FF. 

-FT.- 

F-T,FFT") 

Figwe  11:  The  KiMMOgenerator  nyntem  of  Figure  10  goca  through  these  steps  when  appUed  to  the 
encoded  version  of  the  (satisfiahlc)  formula  (x  V  y)&(y  V  x)&(y  V  z)it[x  V  y  V  »).  Though  only  one 
tTuth*assiguncut  will  satisfy  the  formula,  Uptakes  quite  a  bit  of  backtracking  to  find  it.  The  notation 
u-ned  here  fur  describing  generator  actions  is  similar  to  that  used  to  describe  recognizer  actions  in 
Figure  2,  but  a  surface  rather  than  a  lexical  string  is  the  goal.  As  in  figure  7,  a  ^-entiy  in  the 
backtracking  colunui  indicates  backtracking  from  an  immediate  failure  in  the  preceding  step,  which 
dues  not  require  the  full  backtracking  mechanism  to  be  invoked. 


KIMMO  RECOGNITION  if  <r  is  a  recognisable  word  according  to  the  constraints  of  A  and 
D. 

Many  reductions  are  possible,  but  the  reduction  that  will  be  sketched  here  uses  the  3SAT 
problem  instead  of  SAT.  It  also  uses  an  encoding  for  CNF  formulas  that  is  slightly  different 
from  the  one  used  in  the  generator  reduction.  To  encode  a  SAT  problem  ^  as  a  triple  (A,  D,a), 
first  construct  a  from  ^  by  a  new  notational  translation.  This  time,  treat  a  variable  x  and 
its  negation  x  as  separate,  atomic  characters.  Continue  to  use  a  comma  for  conjunction  and 
no  explicit  operator  for  disjunction,  but  now  add  a  period  at  the  end  of  the  formula.  Then 
the  <7  corresponding  to  the  formula  (i  V  *  V  y)A(y  V  p  V  s)k(x  V  y  V  x)  is  xxy ,  yyz ,  xyz . , 
a  string  of  12  characters.  (With  3SAT,  the  commas  are  redundant,  but  they  arc  retained  here 
ui  the  interest  of  readability.) 

Second,  construct  A  (in  polynomial  time)  in  two  jnirts.  (As  before,  A  varies  from  formula 
to  formula  only  when  the  formulas  involv<*  different  sets  of  variables.)  The  alphabet  spocifi- 
cation  should  list  the  variables  in  <t  together  with  their  negations  and  the  special  characters 
T,  F.  comma,  and  period.  The  eqinds  sign  should  again  be  dechired  as  the  KiMMO  wildcard 
character.  The  consistency  automata,  stiU  one  for  each  variable  in  <r,  should  be  constructed 


20 


I 


6«nirit1ng  from  Itxicol  form 

1  '  3. 1.1,1 

2  FF  3. 3. 1.1 

3  FFF  3.3.3. 1 

*  FFF.  XXX  »atl*. 

6  (3)  FFT  3. 3. 2. 2 

6  FFT.  3. 3. 2.1 

T  FFT.-  3. 3. 2. 3 

»  FFT.-T  XXX  x-con. 


-yx.-xjr*. 


.-T 

. -F 

.-F- 

.-F-F 

,-F-T 

.-F-T. 

.-F-T.- 

.-F-T.-T 


10 

FFT.-F- 

3, 3. 2, 2 

80 

FTT. -F-T 

-FF 

11 

FFT.-F-F 

XXX  z-con. 

81 

♦ 

FTT. -F-T 

-FT 

12 

♦ 

FFT. -F-T 

3. 3. 2. 2 

82 

FTT. -F-T 

-FT, 

13 

FFT. -F-T. 

3. 3, 2.1 

83 

FTT. -F-T, 

-FT.- 

14 

FFT. -F-T,- 

3. 3. 2, 3 

84 

FTT. -F-T, 

-FT.-F 

16 

FFT. -F-T. -T 

XXX  x-con. 

85 

■f 

FTT. -F-T, 

-FT.-T 

16 

♦ 

FFT.-f-T.-F 

3. 3. 2, 2 

86 

FTT. -F-T, 

-FT.-T 

17 

FFT. -F-T. -FF 

XXX  z-con. 

87 

FTT. -F-T, 

-FT,-T 

16 

♦ 

FFT. -F-T. -FT 

3. 3. 2. 2 

88 

♦ 

FTT. -F-T, 

-FT.-T 

19 

FFT. -F-T. -FT. 

3. 3. 2.1 

89 

FTT. -F-T. 

-FT.-T- 

20 

FFT. -F-T, -FT.- 

3. 3. 2. 3 

90 

(1) 

T 

21 

FFT. -F-T. -FT. -T 

XXX  y-con. 

91 

TF 

22 

♦ 

FFT. -F-T. -FT. -F 

3. 3. 2. 2 

92 

TFF 

23 

FFT. -F-T. -FT. -F- 

3. 3. 2. 2 

93 

TFF. 

24 

FFT.-F-T.-FT.-F-F 

XXX  z-con. 

94 

TFF.- 

25 

♦ 

FFT. -F-T, -FT. -F-T 

3. 3. 2. 2 

96 

TFF,-F 

26 

FFT. -F-T. -FT. -F-T. 

3. 3. 2.1 

96 

♦ 

TFF.-T 

27 

FFT. -F-T. -FT. -F-T.- 

3. 3, 2. 3 

97 

TFF.-T- 

28 

FFT. -F-T. -FT. -F-T, -T 

XXX  y-con. 

98 

TFF.-T-T 

29 

♦ 

FFT. -F-T. -FT. -F-T. -F 

3. 3. 2. 2 

99 

4> 

TFF.-T-F 

30 

FFT. -F-T. -FT, -F-T. -FF 

XXX  z-con. 

100 

TFF.-T-F. 

31 

♦ 

FFT. -F-T. -FT. -F-T. -FT 

3. 3. 2. 2 

101 

TFF,-T-F. 

32 

FFT. -F-T. -FT. -F-T. -FT. 

3. 3. 2.1 

102 

TFF.-T-F, 

-F 

33 

FFT. -F-T. -FT. -F-T. -FT.- 

3. 3, 2. 3 

103 

■f 

TFF.-T-F. 

-T 

34 

FFT, -F-T, -FT, -F-T. -FT. -F 

XXX  z-con. 

104 

tff.-t-f. 

-TT 

35 

♦ 

FFT. -F-T. -FT. -F-T. -FT. -T 

3. 3. 2.1 

105 

♦ 

TFF.-T-F. 

-TF 

36 

FFT, -F-T. -FT. -F-T. -FT. -TT 

XXX  y-con. 

106 

TFF.-T-F. 

-TF, 

37 

♦ 

FFT. -F-T. -FT. -F-T. -FT. -TF 

3. 3. 2.1 

107 

(92) 

TFT 

38 

FFT. -F-T. -FT. -F-T. -FT. -TF 

XXX  satis,  nf. 

108 

TFT. 

39  (2) 

FT 

3.2. 1.2 

109 

TFT.- 

40 

FTF 

3. 2. 3. 2 

no 

TFT.-F 

41 

FTf. 

3, 2. 3,1 

111 

♦ 

TFT.-T 

42 

FTF.- 

3.2. 3.3 

112 

TFT,-T- 

43 

FTF.-T 

XXX  x-con. 

113 

TFT,-T-F 

44 

♦ 

FTF.-F 

3. 2. 3, 2 

114 

♦ 

TFT.-T-T 

46 

FTF.-F- 

3. 2. 3. 2 

115 

TFT.-T-T. 

46 

FTF. -F-T 

XXX  z-con. 

116 

(91) 

TT 

47 

♦ 

FTF.-F-F 

3. 2. 3. 2 

117 

TTF 

48 

FTF.-F-F, 

3. 2. 3.1 

118 

TTF, 

49 

FTF.-F-F.- 

3. 2. 3. 3 

119 

TTF.- 

50 

FTF.-F-F.-T 

XXX  x-con. 

120 

TTF.-F 

61 

♦ 

FTF.-F-F,-F 

3. 2. 3, 2 

121 

4- 

TTF.-T 

62 

FTF.-F-F. -FT 

XXX  z-con. 

122 

TTF.-T- 

63 

♦ 

FTF.-F-F.-FF 

3, 2. 3. 2 

123 

TTF.-T-T 

54 

FTF.-F-F,-FF. 

3.2.3, 1 

124 

TTF.-T-F 

65 

FTF.-F-F.-FF.- 

3, 2. 3. 3 

125 

TTF.-T-F, 

66 

FTF,-F-F.-FF.-F 

XXX  y-con. 

126 

TTF.-T-F, 

- 

57 

♦ 

FTF.-F-F.-FF.-T 

3.2.3. 1 

127 

TTF.-T-F. 

-F 

58 

FTF.-F-F.-FF.-T- 

3. 2. 3. 3 

128 

TTF.-T-F. 

-T 

59 

FTF,-F-F.-FF.-T-T 

XXX  z-con. 

129 

TTF.-T-F. 

-TT 

60 

•f 

FTF.-F-F.-FF,-T-F 

3. 2. 3. 2 

130 

♦ 

TTF.-T-F. 

-TF 

61 

FTF.-F-F.-FF.-T-F, 

3.2.3. 1 

131 

TTF.-T-F, 

-TF. 

62 

FTF,-F-F.-FF.-T-F.- 

3. 2. 3. 3 

132 

(117) 

TTT 

63 

FTF.-F-F,-FF.-T-F,-F 

XXX  y-con. 

133 

TTT. 

64 

■f 

FTF,-F-F.-FF.-T-F.-T 

3.2.3. 1 

134 

ttt,- 

65 

FTF,-F-F.-FF.-T-F.-TT 

XXX  z-con. 

135 

TTT,-F 

66 

♦ 

FTF.-F-F.-FF.-T-F.-TF 

3.2.3. 1 

136 

•f 

TTT.-T 

67 

FTF.-F-F.-FF.-T-F.-TF. 

XXX  satis. 

137 

TTT.-T- 

68 

(40) 

FTT 

3. 2. 2. 2 

138 

TTT.-T-F 

69 

FTT. 

3.2.2. 1 

139 

♦ 

TTT.-T-T 

70 

FTT.- 

3. 2. 2, 3 

140 

TTT,-T-T. 

NIL 

X  x-con. 

2.2,2 

2.2,2 

X  z-con. 

2.2.2 

2.2.1 

2.2.3 

X  x-con. 

2.2.2 

(  z-con. 

2.2.2 

2.2.1 

2.2.3 

I  y-con. 

2.2.1 

2.2.3 
[  z-eon. 

2.2.1 
:  satis. 
.1.2 
M.2 

1.3. 2 
i.3.1 

1.3.3 

;  x-con. 
.3.1 
.3.3 
z-con. 
.3.2 
.3.1 
.3.3 
x-con. 
.3,1 
z-con. 
.3.1 
satis. 
.2.2 
.2.1 
.2.3 
x-con. 
.2,1 
.2,3 
z-con. 
.2.1 
satis. 
.1.2 
.3.2 
.3.1 
.3.3 
x-con. 
.3.1 
.3.3 
z-con. 
.3.2 
.3.1 
.3.3 
x-con. 
.3,1 
z-con. 
.3.1 
satis. 
.2.2 
.2,1 
.2.3 
x-con . 
.2.1 
.2.3 
z-con. 
.2.1 
satis . 


Figure  12:  The  KIMMO  generator  nyntem  of  Figure  10  gor:<  through  140  ntepn  lieforc  verifying  that  the 
formula  (z  V  y  V  x)&(x  V  x)&(z  V  x)&(y  V  x)&(y  V  s)&(z  V  y)  ban  no  8.ati:ifying  trnth-ansigniucnt. 


'x-consliteney*  3  6 

tiff- 

I  I  X  I  •= 

1:  3  3  3  2  1 

2:  2  0  0  2  2 

):  0  3  3  0  3 


(lexical  ekaraeteri) 
(turfaee  eharaetert) 
(x  undecided) 

(x  true) 

(x  false) 


Figure  13;  The  KiMMO  recognizer  system  that  encodes  a  3SAT  formula  ifi  should  include  a  eonaisteney 
automaton  of  this  form  for  every  variable  z  that  occurs  in  ip.  As  in  the  generator  n-duction,  the 
ronsisteury  automaton  constrains  the  mapping  from  variables  to  truth-values,  ensuring  that  the  value 
assigned  to  z  is  consistent  throughout  the  formula.  However,  in  the  recognizer  reduction  the  automaton 
must  al.so  ensure  that  the  values  assigned  to  z  and  z  are  opposites,  suice  z  and  z  are  treated  as  atomic 
alphabet  characters. 


ALTERNATIONS 

(  Root  <  Root  ) 

(  Punct  >  Punct  ) 

(  <»  -  ) 

END 


LEXICON  Root 


LEXICON  Punct 


Figure  14:  The  3SAT  recognizer  system  for  any  ftvmula  should  include  this  dictionary  conq>aDent, 
which  ensures  that  the  truth-values  assigned  to  the  variables  in  the  surface  string  will  cause  the 
formula  to  come  out  true.  All  combinations  of  three  truth  values  are  listed,  except  for  the  value  FFF 
that  wo<ild  cause  one  of  the  3CNF  disjunctions  to  be  false;  the  same  dictionary  component  is  used  for 
all  3SAT  ])roblems.  Each  lexicon  entry  specifics  the  coutiiiuntion  class  of  lexicons  that  can  follow.  For 
instance,  the  class  Punct  containing  only  the  lexicon  Punct  is  the  continuation  class  of  TTT,  while  the 
class  of  .  is  the  empty  continuation  class  t.  ""  is  an  empty  feature  set,  used  since  no  word  features 
are  being  recovered  in  this  mathematical  reduction.  The  detailed  format  of  the  dictionary  campmient 
is  described  in  G)\jek  et  al.  (1083). 


as  in  Figure  13.  There  is  no  satisfaction  automaton  in  this  version  of  the  recogniscr. 

Finally,  take  D  as  a  runstaiit  from  Figure  14.  In  this  reduction,  D  imposes  the  satisfaction 
roust  raint  that  wtis  enforced  with  an  automaton  in  the  generator  reduction.  Formula  ip  wiU 
be  satisfied  ilT  all  of  its  conjuncts  are  satisfied,  and  since  ^  is  in  3CNF,  that  means  the  truth- 
values  assigned  within  each  disjunction  roust  be  TTT,  TTF,  . . .,  nr  any  rumbination  of  three 
triitli-viUues  except  FFF.  Tliis  is  exactly  the  constraint  imposed  by  the  dictiorotry.  (Note  that 
D  is  the  .same  for  every  3SAT  problem;  it  docs  not  grow  with  the  size  of  the  formula  or  the 
imniber  of  variables.) 

(loniparecl  to  the  generator  reduction,  the  roles  of  the  lexical  and  surface  strings  are 


reversed  in  the  recognizer  reduction.  The  surface  string  encodes  ip,  while  the  lexical  string 
indicates  truth-values  for  its  variabh^s.  The  consistency  automaton  for  each  variable  x  still 
ensures  that  the  value  assigned  to  x  is  consistent  throughout  the  formula,  but  now  it  also 
ensures  that  x  and  x  are  assigned  opposite  valuta.  As  before,  the  net  result  of  the  constraints 
imposed  by  the  various  components  is  that  (A,D.ff)  is  in  KIMMO  RECOGNITION  just  in 
case  p  has  a  satisfying  truth-assignment.  The  general  case  of  KIMMO  RECOGNITION  is  at 
least  as  hard  as  3SAT,  hence  at  least  as  hard  as  SAT  or  any  other  problem  in  M  P  (in  the 
sense  of  polynomial-time  reduction). 


5.  The  Effect  of  Precompilation 


The  reductions  presented  in  section  4  require  both  the  lanjtuagc  description  and  the  input 
string  to  vary  with  the  SAT/3SAT  problem  to  be  stdved.  Hence,  there  arises  the  question 
of  whctlier  some  computationally  intensive  form  of  prccompilatioit  could  blunt  the  force 
the  reduction,  paying  a  potentially  exponential  compilation  cost  once  and  allowing  KiMMO 
runtime  for  a  given  grammar  to  be  uniformly  fast  thereafter.  This  section  examines  four 
aspects  of  the  precompilation  question. 


5.1.  Conversion  to  GMACHINE/RMACHINE  Form 

The  external  description  of  a  KiMMO  automaton  or  lexicon  is  not  the  same  as  the  form 
that  is  used  by  the  generation  or  recognititm  algorithm  at  runtime.  Instead,  the  external  de¬ 
scriptions  arc  u.scd  to  construct  internal  forma:  RMACHINE  and  OMACHINE  forms  for  automata, 
and  letter  trees  for  lexicons  (Gajck  et  al.,  1983).  Hence  one  question  to  address  is  whether  the 
complexity  implied  by  the  reduction  might  actually  apply  to  the  construction  of  these  internal 
forma.  If  this  were  true,  then  the  complexity  of  the  generation  problem  (for  instance)  would 
be  concentrated  hi  the  construction  of  the  ‘Tcasiblc-pair  list”  and  the  GMACHINB. 

It  is  {Kissihlc  to  deal  with  this  question  directly  by  reformulating  the  reduction  so  that  the 
formal  problems  and  the  construction  specify  macliines  in  terms  of  their  internal  (e.j.  GMA> 
chine)  forms  instead  of  their  external  descriptions.  Tlie  GMAClIINEs  for  the  class  of  machines 
created  in  the  construction  have  a  very  regular  structure,  and  it  is  easy  to  build  them  directly 
instead  of  building  descriptions  in  external  format.  As  Figure  11  also  suggested,  it  is  runtime 
processing  that  makes  translated  SAT  problems  difficult  for  a  KiMMO  system  to  solve. 

5.2.  BIGMACHINE  Precompilation 

There  is  also  another  kind  of  preprocessing  that  might  be  expected  to  help.  As  men¬ 
tioned  in  section  2.1.2,  it  is  possible  to  cmnpilc  a  set  of  KiMMO  automata  into  a  single  large 
automaton  that  will  run  faster  than  the  original  set.  The  system  will  usually  run  faster  with 
one  large  automaton  than  with  several  small  ones,  since  it  has  only  one  machine  to  step  and 
tlic  speed  of  stepping  a  machine  is  largely  independent  of  its  sise.  However,  in  the  worst  case 
the  merged  automaton  is  prohibitively  large,  exponentially  larger  than  the  smaller  machines 
(Karttuncu,  1983:176). 

Gajek  et  al.  (1983)  use  the  terms  DIGGMACHINE  and  OIGRMACIIINE  to  refer  to  the  gener¬ 
ation  and  recognition  vcr.sions  of  a  large  merged  automaton,  and  thert’forr  sucli  an  automaton 
will  be  called  a  BIGMACHINE.  Since  it  can  take  exponential  time  to  build  the  OIGMACHINB 
for  a  translated  SAT  problem,  the  reduction  formally  allows  the  possibility  that  BIGMACHINE 
precompilation  could  make  runtime  processing  imiformly  efficient. 

However,  an  expensive  BIGMACHINE  preccmipilation  step  doesn’t  help  runtime  processing 
«nough  to  change  the  fundamental  complexity  of  the  algorithms.  Recall  from  section  3.3  that 
tlie  imiiii  ingredients  of  KiMMO  runtime  complexity  arc  the  nieclianical  operation  of  the  au¬ 
tomata,  the  difficulty  of  fiitditig  the  correct  lexical-  surface  correspondence,  and  the  necessity 


of  choosing  among  alternative  lexicons.  DIGMACHINE  precompilation  will  speed  up  the  me¬ 
chanical  operation  of  the  automata,  perhap.s  by  a  factor  equal  to  the  number  of  variables  in 
the  SAT  query.  However,  it  wiU  not  help  in  the  task  of  deciding  wliich  Icxical/surfacc  pair  will 
be  globally  acceptable.  The  DIGMACHINE  will  be  as  hinited  as  the;  equivalent  automata  in  its 
forecasting  abilities.  Precompilation  oils  the  machinery,  but  doesn’t  accomplish  fundamental 
redesign. 

5.3.  BIGMACHINE  Size  and  the  Interaction  of  Constraints 

DIGMACHINE  precompilation  sheds  light  on  another  precompilation  question  as  well.  It 
is  known  that  the  compiled  DIGMACHINE  corresponding  to  a  act  of  KiMMO  automata  can  be 
exponentially  larger  than  the  original  system  in  the  worst  case;  for  example,  such  blowup 
occurs  if  the  SAT  automata  arc  compiled  into  a  DIGMACHINE.  In  practice,  howfrver,  the  size 
of  the  DIGMACHINE  varies  —  thus  naturally  raising  the  question  of  what  distinguishes  the 
“explosive”  sets  of  automata  from  those  that  behave  more  tractably. 

It  is  sometimes  suggested  that  the  degree  of  interaction  among  eonetrainte  determines 
the  amount  of  DIGMACHINE  blowup.  In  this  view,  a  large  DIGMACHINE  for  a  SAT  problem  is 
no  surprise,  for  the  computational  difficulty  of  SAT  and  similar  problems  results  in  part  from 
their  “global”  character.  Their  solutions  generally  cannot  be  deduced  piece  by  piece  from 
local  evidence;  instead,  the  acceptability  of  each  part  of  the  solution  may  depend  on  the  whole 
problem.  In  the  worst  case,  the  solution  is  determined  by  a  complex  conspiracy  among  the 
constraints  of  the  problem.  Thus  the  large  DIGMACHINE  gives  a  more  “honest”  estimate  of 
problem  difficulty  than  the  small  collection  of  individual  automata. 

However,  a  slight  change  in  the  SAT  automata  demonstrates  that  DIGMACHINE  sise  need 
not  correspond  to  the  degree  of  interaction  among  the  automata.  Eliminate  the  satisfaction 
automaton  from  the  generator  system,  leaving  only  the  consistency  automata  for  the  variables. 
Then  the  system  will  not  search  for  a  eatiefging  truth-assigment,  but  merely  for  one  that  is 
internallg  eoneietent  —  that  is.  one  that  never  assigns  both  T  and  F  to  the  same  variable  in  its 
different  occurrences.  This  change  will  entirely  eliminate  the  interactions  among  the  automata; 
each  automaton  is  concerned  only  with  the  assigments  to  its  particular  variable,  and  there  is  no 
way  for  an  assignment  to  one  variable  to  influence  the  acceptability  an  assignment  to  another. 

Yet  despite  the  elimination  of  interactions,  the  DIGMACHINE  must  still  be  exponentially 
larger  than  the  collection  of  individual  automata.  Since  the  state's  of  the  DIGMACHINE  must 
distinguish  all  the  possible  truth-assignments  to  the  variables,  its  sise  must  be  exponential  in 
the  niunber  of  individual  automata.  In  fact,  the  lark  of  interactions  can  actually  inereaee  the 
number  of  states  in  the  DIGMACHINE.  Interactions  among  the  automata  constrain  the  com¬ 
binations  of  states  that  can  be  re<iched,  thus  reducing  the  nmnber  of  accessible  combinations 
below  the  mathematical  upper  limit. 

5.4.  Transducers  and  Determinization 

One  more  prerompilation  question  is  whether  the  iiondeterminism  involved  in  constructing 
the  lexical  surface  correspondence  can't  Im*  removed  by  standard  determinization  techniques 


Figure  15;  Thin  nondetcnninistic  finite-state  transducer  cannot  be  detemunised.  An  equivalent  de- 
terniiiiistic  FST  would  have  to  wait  for  the  end  of  the  input  string  before  generating  any  output. 
However,  at  that  point  it  would  have  to  rciuend>cr  Low  many  os  or  is  to  output  in  correspondence 
with  the  unbounded  number  of  zs  in  the  string  —  an  impossible  task  for  a  finite-state  device. 


for  finite-state  machines.  After  all,  every  nondeterministic  finite-state  machine  has  a  detei- 
ministir  counterpart  that  is  equivalent  in  the  sense  that  it  accepts  the  same  language.^*  Aren't 
KiMMO  automata  just  ordinary  finite-state  machines  operating  over  an  alphabet  that  happens 
to  consist  of  pairs  of  characters? 

It  is  indeed  possible  to  view  KlMMO  automata  in  this  way  when  they  are  being  used  to 
verily  or  reject  hypothesized  pairs  at  lexical  and  surface  strings.^*  However,  in  this  nae  they 
don’t  need  determinizing:  they  are  already  deterministic,  for  there  is  only  one  new  state  listed 
in  each  cell  of  the  description  of  a  KiMMO  automaton.  In  the  cases  of  primary  interest  — 
generation  and  recognition  —  the  machines  are  being  nsed  as  genuine  transducers  rather  than 
acceptors. 

Tlie  determinizing  algorithms  that  apply  to  finite-state  acceptors  will  not  work  on  trans¬ 
ducers.  Indeed,  many  finite-state  transducers  are  not  determinizable  at  all.  For  cscample, 
consider  the  transducer  in  Figure  15.  On  inpnt  zxzzza  it  must  output  aaaaaa,  while  on  input 
xxxxxb  it  must  output  666666.  An  equivalent  deterministic  finite-state  transducer  is  impossible. 
A  deterministic  transducer  xould  not  know  whether  to  output  a  or  6  upon  seeing  z.  However, 
it  also  could  not  output  nothing  and  put  off  the  decision  until  later:  being  finite-state,  it  would 
not  in  general  be  able  to  remember  at  the  end  how  many  occurrences  of  z  there  had  been,  so 
it  would  not  be  able  to  print  the  right  number  initial  occurrences  of  o  or  b. 

For  similar  reasons,  there  is  no  way  to  build  deterministic  finite-state  transducers  for  the 
SAT  problems.  Upon  seeing  the  first  occurrence  of  a  variable,  a  deterministic  transducer  could 
not  know  in  general  whether  it  should  output  T  or  F.  However,  it  also  could  not  wait  and  output 
a  truth-wiluc  later,  for  there  might  bo  an  luibounded  number  of  occurrences  aS  the  variable 

'  '‘But  not  in  the  aeniic  that  it  wwigns  the  Mine  paesr*  to  the  strings  of  the  language,  where  a  parse  according  to 
a  finite-state  machine  is  the  sequence  of  states  traversed  -  a  point  related  to  the  inipoasilnHty  of  detenninising 
tr;tDsdiicers. 

'"This  statement  ignores  any  subtleties  having  to  do  with  the  processing  of  nulls,  which  will  be  discussed 
later  (§6). 


before  there  was  sufficient  evidence  to  assign  the  trutb*va]ue.  A  finite-state  transducer  would 
not  be  able  in  general  to  remember  how  many  truth-value  outputs  had  been  deferred. 


6.  The  Effect  of  Nulla 


Since  KiMMO  systems  can  encode  P -complete  proMons,  the  genera]  KiMMO  generation 
and  recognition  problems  are  at  least  as  hard  as  the  roinputationally  difficult  problems  in 
}JP.  But  could  they  be  even  harder?  The  answer  depends  on  whether  null  characters  are 
allowed.  If  null  characters  are  forbidden,  the  problems  are  ia  }JP,  hence  (given  the  previous 
A/?-hardnrss  result)  >/P-completc  (§6.1).  If  null  characters  are  completely  unrestricted,  the 
problems  arc  PSPACE-coniplcte,  thus  potentiaUy  even  harder  than  the  problems  m  MP  (§6.2). 
However,  the  full  power  of  unrestricted  null  characters  is  not  needed  for  liuguistkally  relevant 
processing.  Continuing  to  exi>lore  the  effect  KiMMO  null  characters,  section  6.3  mentions  a 
subtle  point  —  with  computational  consequences  —  about  the  interpretation  of  the  KiMMO 
constraint-intersection  operation  when  nulls  are  involved. 


6.1.  >/.P-Completeness  Without  Nullt 

The  generation  and  recognition  problems  for  KiMMO  automata  without  nulls  are  UP- 
complete.  Since  section  4  showed  that  the  problems  were  >/P-hard,  all  that  remains  is  to 
show  that  a  nondcterministic  machine  could  solve  them  in  polynomial  time.  Only  a  sketch  of 
the  proeffs  will  be  given. 

Given  a  possible  instance  {A.a)  of  KIMMO  GENERATION,  the  basic  nondetmminiam 
of  the  machine  can  be  used  to  gueas  the  surface  string  corresponding  to  the  lexical  string  o. 
The  automata  can  then  quickly  verify  the  correspondence.  The  key  fact  is  that  if  A  allows  no 
nulls,  the  lexical  and  surface  characters  must  be  in  one-to-one  correspondence.  The  surface 
string  must  be  the  same  length  as  the  lexical  string,  so  the  sice  of  the  guess  can’t  get  out  of 
hand.  (If  the  guess  were  too  large,  the  machine  would  not  run  in  polynomial  time.) 

Given  a  possible  instance  {A,D,o)  KIMMO  RECOGNITION,  the  machine  should 
guess  the  lexical  string  instead  of  the  surface  string;  as  before,  its  length  will  be  manageable.^* 
Now,  however,  the  machine  must  also  guess  a  path  through  the  dictiemary.  The  number  of 
choice  points  is  limited  by  the  length  of  the  string,**  while  the  number  of  choices  at  each  pmnt 
is  limited  by  the  number  of  lexicons  in  the  dictionary.  Given  a  lexical-surface  correspondence 
and  a  lexicon  jtath,  the  automata  and  the  dictionary  component  can  quickly  verify  that  the 
lexical/surface  string  pair  satisfies  aU  relevant  constraints. 

'^Wben  nulls  arc  allowed  as  in  the  next  section,  the  machine  must  also  guess  where  to  insert  0  characters  into 
the  surface  string.  Because  of  the  way  the  automata  operate,  the  strings  that  arc  submitted  to  the  automata 
for  verification  must  include  the  nulls. 

'"Nulls  in  the  lexicon  do  not  have  the  some  interpretation  as  nulls  in  the  autinnata.  Nulls  diould  not  occur 
in  the  dictionary,  except  in  "null  lexicon  entries*  that  arc  written  os  0  in  their  entirety.  Unlike  nulls  in  the 
automaton  component,  which  ore  treated  os  genuine  characters  by  the  automata,  null  lexicon  entries  are  merely 
a  notatioual  device  and  con  be  nmoved  in  the  course  of  constructing  letter  trees  from  the  lexicons.  Ihus  the 
number  of  choice  paints  in  the  lexicon  data-structure  is  limited  by  the  length  of  the  lexical  string  evm  whm 
nulls  ore  permitted. 


6.2.  PSPACE-Completeness  with  Unrestricted  Nulls 


If  nulls  arc  completely  unrestricted,  the  arguments  of  section  6.1  do  not  go  through.  The 
problem  is  that  unrestricted  null  characters  allow  the  lexical  and  surface  .strings  to  differ  wildly 
in  length.  The  time  it  takes  to  guess  or  verify  the  lexical- surface  correspondence  may  no  longer 
be  polynomially  bounded  in  the  length  of  the  input  string. 

In  fact,  it  is  easy  to  show  that  KIMMO  RECOGNITION  with  unrestricted  null  characters 
is  PSPACE-complete  —  at  least  as  hard  as  any  problem  that  can  be  solved  in  polynomial  space. 
Though  the  question  is  open,  PSPACE-complete  problems  are  likely  to  be  even  harder  than 
MP -complete  problems. 

Not  only  is  a  PSPACE-complete  problem  not  likdy  to  be  in  P,  it  is  also  not  likely  to 
be  in  A/P.  Hence  a  property  whose  existence  question  is  PSPACE-complete  probably 
cannot  even  bo  verified  in  polynomial  time  using  a  polynomial  length  “guess.”  (Garey 
and  Joluison,  1979:171). 

Thus  the  worst  case  of  KIMMO  RECOGNITION  becomes  extremely  difficult  if  null  charac¬ 
ters  are  completely  unrestricted.  (Incidentally,  PSPACE  includes  such  problems  as  deciding 
whether  a  player  has  a  forced  win  from  certain  N  x  N  checkers  or  Go  configurations.**) 

The  easiest  PSPACE-completcness  reduction  for  KIMMO  RECOGNITION  with  unre¬ 
stricted  nulls  involves  the  computational  problem  FINITE  STATE  AUTOMATA  INTERSEC¬ 
TION  (Garey  and  Johnson,  1979:266).  A  possible  instance  of  FSAI  is  a  set  of  deterministic 
finite-state  automata  over  the  same  alphabet.  The  problem  is  to  determine  whether  there  is 
any  string  that  is  accepted  by  all  of  the  automata.  Given  a  fH>t  of  automata  over  alphabet 
£,  construct  a  corresponding  KIMMO  RECOGNITION  problem  as  follows.  Let  a  and  b  be 
new  characters  not  in  £,  and  take  the  KiMMO  alphabet  to  be  £  U  {o,6}.**  Declare  >  as  the 
wildcard  character  and  0  as  the  null  character. 

Then  build  the  rest  of  the  automaton  component  in  two  parts.  First,  include  the  following 
“main  driver”  automaton: 


"Main  Driver”  3  3 


1. 

2. 

3: 


a  6  ■ 
a  h  0 
2  0  0 
0  3  2 
0  0  0 


(lexical  eharaetere) 
(ewfaee  eharaetera) 
(want  a) 

(let  automata  run) 
(got  ab;  final  Hate) 


This  will  accept  the  surface  string  ab,  allowing  arbitrary  lexical  gyrations  between  a  and  b 
as  long  as  they  come  out  null  on  the  surface.  Second,  for  each  of  the  automata  ui  the  FSAI 
problem,  translate  it  directly  into  a  KiMMO  automaton  by  pairing  the  original  characters  from 
£  with  surface  nulls.  Also  a<ld  columns  for  a/a  and  fr/b,  with  entries  sero  unless  otherwise 
specified.  Bump  all  of  the  state  numbers  up  by  two.  Let  the  new  start  state  accept  only  a/a, 

‘*A  few  restrictions  on  the  pmhlrnis  are  nreessary  in  order  to  show  membership  in  PSPACE.  For  detailt, 
see  Garey  and  Johnson  (1070:173.2r>0f)  and  references  cited  therein. 

^*’The  reduction  con  also  be  ilone  without  a  and  6,  but  they  iire  included  because  the  resulting  reduction  is 
more  reminiscent  of  ordinary  processing  pr«>bienis  in  which  the  ijiiestion  arises  of  how  many  nulls  to  hypothesise 
between  characters. 


29 


m  w  i  -j  wjij  wj  j  I'j  g  p  uj  u  ar  4.''  iL' 


going  to  3  (the  old  start  state).  Let  only  state  2  be  a  final  state,  but  for  every  state  that  was 
final  in  the  original  automaton,  give  it  a  transition  to  2  on  h/b. 

Tliird,  let  the  root  lexicon  the  dictionary  component  contain  a  lexicon  entry  for  each 
single  character  in  £  U  {a.  6}.  The  continuatiou  class  of  each  entry  slionld  send  it  back  to  the 
root  lexicon,  except  that  the  entry  for  b  should  list  the  word-final  continuation  class  #  instead. 
Finally,  take  ab  as  the  surface  string  for  the  KIMMO  RECOGNITION  problem.  Surface  a 
will  start  up  the  translated  versions  of  the  original  automata,  which  will  be  able  to  run  freely 
in  between  the  a  and  the  b  because  the  characters  in  £  all  get  paired  with  surface  nulls.  If 
there  is  some  string  that  all  of  the  original  automata  accept,  that  lexical  string  will  send  all  of 
the  translated  automata  into  a  state  where  the  remaining  b  is  acceptable.  On  the  other  hand, 
if  the  original  intersection  is  empty,  the  b  will  never  become  acceptable  and  the  lecogniser  will 
not  accept  the  string  ab. 

This  construction  forms  one  half  of  the  PSPACE-completcness  proof,  but  it  is  also  nec¬ 
essary  to  show  that  KIMMO  RECOGNITION  is  no  liardcr  than  problems  in  PSPACE.  It 
is  sufficient  to  transform  arbitrary  KIMMO  RECOGNITION  problems  into  FSAI  problems. 
Given  a  recognition  problem,  first  convert  the  dictionary  component  into  a  large  automaton 
that  (i)  constrains  the  lexical  string  in  the  same  way  the  dictionary  component  does,  pairing 
lexical  characters  with  surface  wildcards,  but  (ii)  allows  nulls  to  be  inserted  freely  at  the  lex¬ 
ical  level,  in  case  the  other  aatomata  permit  lexical  nulls.  The  conversiqn  can  be  performed 
because  the  dictionary  component  is  finite-state.  Second,  convert  tlie  input  string  into  an 
automaton  as  wcU.  The  input-string  automaton  should  (i)  constrain  the  surface  string  to  be 
exactly  the  input  string,  but  (ii)  allow  surface  nulls  to  be  inserted  freely.  Third,  expand  out 
all  wildcard  and  subset  characters  in  the  automata,  then  interpret  each  Icxkal/surface  pair 
at  the  head  of  an  automaton  column  as  a  single  character  in  an  extended  alphabet.  Given 
this  preparation,  it  is  possible  to  solve  the  original  recognition  problem  by  solving  FSAI  far 
the  augmented  set  of  automata.  Since  the  input  string  is  now  encoded  as  an  automaton, 
the  intersection  of  the  languages  accepted  by  all  the  automata  consists  of  all  the  permissible 
lexical  -surface  correspondences  tliat  reflect  recognition  of  the  input  string.  The  intersection 
will  be  nonempty  —  as  FSAI  tests  —  if  and  only  if  the  input  string  is  recognisable. 

The  rSPACE-completcness  proof  sliows  tliat  if  nnU  characters  are  completely  unrestricted, 
it  ran  be  very  hard  for  the  rccogniser  to  reconstruct  the  superficially  null  characters  that  may 
lexically  intervene  between  two  surface  characters.  However,  unrestricted  anils  surely  are  not 
needed  for  linguisticidly  relevant  KiMMO  systems.  Processing  complexity  can  be  reduced  by 
any  restriction  that  prevents  the  number  of  possible  nulls  between  surface  characters  from 
getting  too  large.  As  a  crude  approxiraation  to  a  reasonable  constraint,  the  above  reduction 
could  be  ruled  out  by  forbidding  entire  lexicon  entries  to  come  out  null  on  the  surface.’*  A 
suitable  rrstrktiau  would  make  the  KiMMO  generation  and  recognition  problems  only  MP- 
complete  ratlm  than  PSPACE-complete. 

^'Rrcall  troiii  footnote  18  that  an  entry  "0"  in  the  dictionary  is  not  the  same  as  a  dictionary  entry  that  is 
entirely  deleted  at  the  surface  by  the  autonMtta. 


30 


6.3.  The  Intersection  of  Constraints 


Tlic  null  characters  (0)  that  can  appear  in  a  KiMMO  automaton  allow  the  recognicer  to 
advance  without  consuming  any  characters  from  the  input  word.  For  example,  in  analysing  the 
word  hoed  as  hoe-»ed,  the  automata  advance  as  if  the  surface  string  were  hoOOed  (sec  Karttunen 
and  Wittenburg,  1983:220),  postulating  surface  nulls  freely  as  required  by  the  constraints  of 
the  system.  However,  the  interpretation  of  0  as  the  empty  string  involves  some  subtlety  when 
multiple  constraints  arc  involved. 

Internal  to  a  KiMMO  automaton,  0  is  treated  the  same  as  any  other  character,  but  0  is 
effectively  deleted  at  the  interface  to  the  surface  string  or  the  dictionary  component.  Abstractly 
speaking,  the  treatment  of  nulls  by  the  KlMMO  recognir.er  involves  two  steps:  (i)  null  characters 
are  iuaerted  freely  into  the  surface  string  to  produce  a  form  like  hoOOed*,  (ii)  this  augmented 
string  is  tued  to  run  the  automata.  Thus,  a  KlMMO  automaton  can  be  considered  to  define 
both  an  internai  constraint  (relating  the  augmented  strings  with  0  characters  inserted)  and 
an  ezternoi  coustrmnt  (relating  the  strings  os  they  stood  before  0-inscrtion). 

This  distinction  becomes  unportant  when  there  is  more  than  one  automaton  in  a  KOdMO 
system.  The  notion  of  “satisfying  every  constraint”  could  refer  to  intersecting  cither  the 
internal  or  the  external  versions  of  the  constraints  defined  by  the  automata.  If  the  external 
languages  are  intersected,  different  automata  can  disagree  about  the  placement  of  nulls.  (This 
corresponds  to  interpreting  null  characters  as  ordinary  empty  strings  (epsilons,  c),  since  the 
number  of  occurrences  of  the  empty  string  between  any  two  characters  is  indeterminate.)  On 
the  other  liond,  if  the  internal  forma  of  the  constraints  are  intersected,  all  the  automata  must 
agree  on  the  number  of  nulls  and  their  poeitions. 

The  actual  KiMMO  system  performs  internal  intersection  of  the  constraints  defined  by  the 
automata.  Ron  Kaplan^’  has  pointed  out  that  this  subtle  distinction  in  the  interpretation  of 
KIMMO  nulls  has  computational  consequences.  If  tlie  various  constraints  of  a  KiMMO  system 
were  subject  to  external  rather  than  internal  intersection,  thus  interpreting  KIMMO  nulla  as 
ordinary  epsilons,  then  BIGMACHINE  precompilation  would  not  be  generally  possible. 

Since  BIGMACHINE  precompilation  produces  a  single  large  finite-state  transducer  as  out¬ 
put,  the  intersection  operation  tliat  it  implicitly  implements  must  always  map  finite-state 
constraints  into  finite-state  constraints.  External  intersection  docs  not  have  this  property,  and 
therefore  BIGMACHINE  precompilation  would  not  be  generally  possible  if  external  intersection 
were  used.  Specifically,  Kaplan  has  called  attention  to  the  following  finite-state  relations  over 
lexical-surface  pairs: 

A  =  («/b)*(0/c)* 
and  B  =  (0/b)*(a/c)* 

Each  of  these  relations  is  easy  to  encode  in  a  KiMMO  automaton,  but  their  external  intersection 

A  n  f?  =  {a"/b"c"} 

cannot  be  defined  by  any  KiMMO  automaton,  large  or  small,  despite  its  finite-state  origins. 

^^Kaplau's  remarks  were  nuulr  in  a  talk  presrutrd  to  the  Worksliop  on  Finite-State  Morphology,  Center  for 
the  Study  of  Language  and  Information,  Stanford  UnWeraity,  July  20  30,  1085. 


This  c9Hunpk  makes  crucial  use  of  the  fact  that  external  intersection  allows  diffoent 
automata  to  disagree  about  the  placement  id  nulls;  under  internal  intersection  (e.g.  in  the 
current  KiMMO  system)  no  nontrivial  lexical-surface  pair  satisfies  both  of  the  constraints.  For 
instance,  A  will  reject  the  external  string  pair  aa/bbcc  except  as  aaOO/bbcc,  while  B  wiO 
reject  it  except  as  OOaa/bbcc.  Since  internal  intersection  requires  aH  automata  to  agree  about 
the  placement  of  nulls,  aa/bbbb  will  be  rejected  under  internal  intersection. 

The  computational  consequences  of  the  distinction  between  internal  and  external  inter' 
section  become  more  severe  when  KiMMO  systems  arc  generalised  slightly.  For  example,  if 
KlMMO  automata  are  generalised  to  use  three  levels  instead  of  two,  and  if  certain  other  —nail 
changes  arc  made,  then  the  recognition  problem  becomeB  computationaly  undeeidmik  under 
external  intersection  (Barton,  1985b). 


32 


7.  Improving  KiMMO  Dictionary  Efficiency 

One  final  matter  remains.  Despite  the  fact  that  navigation  through  the  lexicons  of  the 
dictionary  component  can  account  for  quite  a  bit  of  backtracking  in  tlie  current  KiMMO  system, 
the  previous  sections  gave  little  attention  to  that  problem.  Instead,  section  3.3.2  promised  that 
the  dictionary  component  could  be  changed  in  such  a  way  that  most  of  the  choice  points  would 
be  eliminated.  This  section  explains  how. 

7.1.  Subdivisions  of  the  Dictionary 

Naturally,  there  would  be  no  need  to  choose  among  alternative  lexicons  if  the  dictionary 
were  not  subdivided.  In  the  existing  KiMMO  system,  subdivisions  arc  needed  for  two  reasons. 
First,  the  continuation-class  mechanism  is  the  only  means  for  expressing  co-occurrence  restric¬ 
tions  among  roots  and  affixes,  and  a  continuation  class  is  a  set  of  lexicons.  Second,  incorrect 
dictionary  search  paths  can  be  recognised  and  pruned  more  quickly  when  suffixes  arc  stored 
separately  from  roots. 

The  existing  continuation-class  mechanism  makes  the  lexicon  the  finest  unit  of  discrimi¬ 
nation  between  sufiixes.  If  a,  z,  y  are  dictionary  entries  such  that  the  sequence  ax  is  possible 
but  ay  is  not,  this  constraint  wiU  be  impossible  to  capture  unless  z  and  y  arc  Usted  in  separate 
lexicons;  if  they  arc  in  the  same  lexicon,  it  will  be  impossible  for  the  continuation  class  of 
o  to  include  z  but  not  y.  Thus  the  need  to  exprias  co-occurrence  restrictions  leads  to  the 
use  of  multiple  lexicons.  For  example,  Kitrttmicn  and  Wittenburg  (1983:224)  must  list  -ed 
and  'er  in  separate  lexicons  because  of  such  contrasts  as  doer/*doed.  In  the  special  case 
of  separated  dependencies,  the  weakness  of  the  current  continuation-class  mechanism  leads 
to  a  large  amount  of  duplicated  structure  in  the  multiple  lexicons  that  must  be  constructed 
(Karttunen,  1983:180). 

Small  lexicons  are  also  advantageous  for  pruning  search,  since  it  can  become  apparent 
very  early  that  no  acceptable  suffix  starts  out  with  the  letters  at  hand.  For  instance,  if  none  of 
the  suffixes  that  can  attach  to  the  current  word  start  with  a,  it  is  pointless  to  search  beyond 
an  a  in  the  input  (ignoring  spelling-change  rules  here).  If  the  legal  suffixes  for  the  current 
class  of  word  are  stored  in  a  separate  lexicon,  the  letter-tree  version  of  the  lexicon  will  not 
be  searched  beyond  an  a.  However,  if  they  ore  listed  with  many  other  suffixes  such  as  -able, 
the  search  will  not  be  aborted  until  later  —  possibly  not  until  the  end  of  a  suffix,  when  the 
combinatory  features  of  the  suffix  can  be  checked. 

Unfortunately,  multiple  lexicons  slow  analysis  down  quite  a  bit  in  the  current  version 
of  KIMMO.  Each  of  the  lexicons  in  a  continuation  class  is  so.archcd  separately.  The  first  few 
characters  beyond  a  lexicon  choice  point  tend  to  get  rcanalysed  several  time's,  with  that  portion 
of  the  lexical- surface  correspondence  worked  out  afresh  each  time.  If  z,  y  above  arc  stems  (N, 
V,  etc.)  instead  of  suffixes  —  that  is,  if  a  is  a  prefix  —  then  the  root  lexicon  becoin«»  subdivided. 
In  such  a  situation,  the  separate  searching  of  the  different  portions  of  the  root  lexicon  becomes 
especially  serious.  Much  storage  is  also  waste«l  (Karttunen  and  Wittenburg,  1983:221f). 

Ill  some  cases,  however,  the  current  finite-state  lexicon  structure  cannot  capture  the  proper 
co-occurrence  restrictions  even  if  duplication  luid  inefficiency  ran  be  t.olerate<l.  Prefixes  gen¬ 
erally  apply  only  to  words  of  particular  chuwes,  thus  making  it  necessary  to  have  separate 

33 


lexicons  for  the  various  classes  of  words  involved.  But  since  prefixes  and  suffixes  can  pro* 
dnctively  form  new  words  of  various  classes  (for  instance,  -ize  fovnis  verbs),  it  may  not  be 
possible  for  a  lexicon  to  bst  them  all.  Formjilly  speaking,  if  both  prefixes  and  suffixes  (i)  are 
fully  productive,  (ii)  can  change  the  categories  of  words  arbitrarily,  tmd  (iii)  can  attach  to  only 
particular  categories  of  words,  then  separated  dependencies  can  arise  that  excised  the  power  of 
a  finite-state  lexicon  structure.  In  such  cases,  context-free  rules  of  some  kind  might  be  better 
suited  to  the  hierarchical  word-structures  that  arc  involved.  Alternatively,  it  might  be  prefer¬ 
able  to  subdivide  the  problem  by  enforcing  only  crude  finite-state  combinatorial  constraints 
while  figuring  out  the  lexical-surface  correspondence,  then  filteriiig  the  analyses  in  a  more 
sophisticated  way  afterward. 

7.2.  Merging  the  Lexicons 

The  number  of  separate  lexicon  searches  can  oliviously  be  reduced  if  there  is  only  one 
lexicon.  Roots  and  affixes  can  all  be  listed  together,  with  the  combinatory  possibilities  of 
various  elements  indicated  by  a  feature  system.  Sucli  a  feature  system  can  be  used  whether  or 
not  the  existing  finite-state  dictionary  framework  is  re])laccd  with  something  more  powerful. 

Within  the  existing  framework,  each  lexicon  name  can  be  uitcrprcted  as  a  feature;  the 
continuation  clasts  of  each  entry  is  then  taken  to  specify  the  possible  lexicon  features  of  its 
immediate  successor  in  the  word.  Alternatively,  a  more  powerful  framework  might  be  modelled 
after  the  linguistic  framework  of  Licber  (1980).  Context-free  madiinery  of  some  kind  could 
implement  the  recovery  of  hierarchical  structure,  the  application  of  Ideber’s  feature-percolation 
conventions,  and  the  enforcement  of  combinatory  restrictions.  Common  grammar-processing 
techniques  could  be  used  to  predict  at  each  boundary  the  set  of  permissible  combinatorial 
features  (the  continuation  class)  of  the  next  segment  of  input. 

As  noted,  however,  merging  the  lexicons  in  this  way  has  the  disadvantage  that  it  prolongs 
some  dictionary  searches  that  would  have  failed  early  with  more  finely-divided  lexicons.  At 
modest  cost  in  time  and  space,  this  disadvantage  can  be  eliminated  by  adding  bit  vectors  to 
the  internal  letter-tree  form  of  the  lexicon.  The  bit  vector  associated  with  a  link  in  the  letter 
tree  indicates  which  clas.ses  of  words  or  afiixes  can  be  found  in  the  subtree  below.  Bit  vectors 
should  also  be  associated  with  the  outputa  of  the  tree. 

The  bit- vector  scheme  makes  it  possible  to  search  in  parallel  through  aU  of  the  lexicons  in 
a  continuation  class.  The  implementation  will  no  longer  interpret  a  continuation  class  in  terms 
of  the  individual  letter-trees  of  several  lexicons;  uistead,  a  continuation  class  will  correspond 
to  an  encoded  set  of  lexicon  names  for  use  in  descending  the  single  merged  letter-tree.  Before 
descending  a  branch  (or  using  an  output),  it  is  necessary  to  check  whether  there  is  a  non-null 
intersection  between  the  lexicons  comprising  the  desired  continuation  class  and  the  lexicons 
accessible  down  the  branch.  On  many  computers,  this  test  can  be  carried  out  in  a  single 
instruction,  if  the  number  of  lexicons  in  the  dictionary  is  small  (e.g.  <  32).  Search  should 
terminate  if  the  intersection  is  mill.  With  the  “virtual’  split  lexicons  provided  by  the  bit-vector 
s<-henie,  a  failing  search  can  temiinate  just  as  early  in  the  lexical  string  as  it  will  with  lexicons 
that  have  individuid  letter-trees:  Figure  1C  shows  an  idealised  illustration.  In  an  actual  system, 
the  dictionary  would  have  more  finely  divided  lexicons  than  N  and  V,  especially  for  suffixra. 

An  inipleinentation  of  this  dictionary  scheme  was  used  to  generate  the  traces  sliown  in 


A 


I 

I 

c 


% 


V  V 

{>»}/  \{v) 


Figure  16:  If  separate  letter  trees  for  nouns  and  verbs  are  merged  as  on  the  left,  failing  searches  majr 
be  prolonged  unnecessarily.  Assuming  that  no  nouns  are  accessible  down  the  kil. . .  branch  of  the 
merged  tree,  it  is  useless  to  traverse  that  branch  if  only  a  noun  is  acceptable  in  the  current  context. 
However,  the  fniitlessness  of  the  branch  may  not  be  appiircnt  until  the  end  of  an  entry  [e.g.  kill) 
is  reached  and  category  features  are  avsiilable.  In  the  letter  tree  on  the  right,  each  link  has  been 
augmented  with  a  bit-vector  that  indicates  the  classes  of  entries  that  are  accessible  down  the  link. 
The  bit-vectors  enable  the  system  to  terminate  a  failing  search  without  going  any  further  down  the 
tree  than  it  would  with  unmerged  lexicons.  In  this  case,  the  kil . . .  subtree  would  not  be  searched 
because  the  intersection  of  {V}  and  {N}  is  null. 


Figure  3  and  succeeding  figures.  Without  the  merged  dictionary,  the  recogniser  for  English 
locates  a  suffix  in  the  continuation  class  /V  by  doing  a  separate  letter-tree  descent  for  each  of 
the  lexicons  P3,  PS,  PP,  PR,  I,  AG,  and  AB.  With  the  merged  dictionary,  the  recogniser  needs 
only  one  letter-tree  descent  in  the  virtual  lexicon  (/V)  =  {P3,PS,PP,PR,  I,AG},  thus  reducing 
the  number  of  steps  needed  to  analyse  an  input.  Finely  divided  lexicons  (hence  continuation 
classes  with  several  members)  are  typically  necessary  for  capturing  co-occurrence  restrictions 
even  in  approximate  form,  and  consequently  the  merged  dictionary  almost  always  speeds  up 
recogniser  operation.  Finally,  even  though  it  takes  extra  space  to  augment  links  and  outputs 
with  bit-vectors,  the  merged  dictionary  can  also  save  space  by  sharing  structure  among  what 
would  otherwise  b<  separate  letter  trees. 


8.  References 


Barton,  E.  (1985a).  “On  the  Complexity  of  ID/LP  Parsing,”  A.I.  Memo  No.  812,  M.I.T. 
Artificial  Intelligence  Laboratory,  Cambridge,  Mass. 

Barton,  E.  (1985b).  “Intractability  in  Finite-State  Machinery,”  A.I.  Memo  No.  878,  M.I.T. 
Artificial  Intelligence  Laboratory,  Cambridge,  Mass.  (Forthcoming;  tentative  title.) 

Berwick,  R.,  and  A.  Weinberg  (1982).  “Parsing  Efficienry,  Computational  Complexity,  and 
the  Evaluation  of  Grammatical  Theories,”  Linguistic  Inquiry  13.2:165-191. 

Clements,  G.,  and  E.  Seser  (1982).  “Vowel  and  Consonant  Disharmony  in  Turkish,”  in 
van  der  Hulst  and  Smith  (1982b:213-256). 

Gajek,  O.,  H.  Beck,  D.  Elder,  and  G.  Whittemore  (1983).  “LISP  Lnplementation  (of  the 
Kimmo  system],”  Texas  Linguistic  Forum  22:187-202. 

Garey,  M.,  and  D.  Johnson  (1979).  Computers  and  Intractability.  San  Francisco:  W.  H.  Free* 
man  and  Go. 

Hale,  K.  (1982).  “Some  Essential  Features  of  Warlpiri  Verbal  Clauses,”  in  Swarts  (1982:217- 
315). 

Karttunen,  L.  (1983).  “KiMMO:  A  Two-Level  Morphological  Analyser,”  Texas  Linguistic 
Forum  22:165-186. 

Karttunen,  L.,  and  K.  Wittenburg  (1983).  “A  Two-Level  Morphological  Analysis  of  English,” 
Texas  Linguistic  Forum  22:217-228. 

Kay,  M..  and  R.  Kaplan  (1982).  “Word  Recognition,”  unpublished  draft  ms.  dated  May  1982, 
Xerox  Palo  Alto  Research  Center,  Palo  Alto,  California. 

Lieber,  R.  (1980).  On  the  Organization  of  the  Lexicon.  Ph.D.  thesis.  Department  of  Linguistics 
and  Pliilosophy,  M.I.T.,  Cambridge,  Mass. 

Lindstedt,  J.  (1984).  “A  Two- Level  Description  of  Old  Chwch  Slavonic  Morphology,”  Seando- 
Slaviea  30:165-189. 

McCarthy,  J.  J.  (1982).  “Prosodic  Templates,  Morphemic  Templates,  and  Morphemic  Tiers,” 
m  van  der  Hulst  and  Smith  (1982a:191-223). 

Nash,  D.  (1980).  Topics  in  Warlpiri  Grammar.  Ph.D.  thesis.  Department  of  Linguistics  and 
Pliilosophy,  M.I.T.,  Cambridge,  Mass. 

Poser,  W.  (1982).  “Phonological  Representation  and  Action- At* A-Distance,”  m  van  dcr  Hulst 
and  Smith  (1982b:121-158). 

Swartz,  S.,  ed.  (1982).  Papers  in  Warlpiri  Grammar  in  Memory  of  Lothar  Jagst.  Work-Papers 
of  SIL-AAB,  Scries  A,  Volume  6,  Summer  Institute  of  Linguistics,  Berrimah,  N.T. 

Underhill,  R.  (197C).  Turkish  Grammar.  Cambridge,  Mass.:  M.I.T.  Press. 

van  der  Hulst,  H.,  and  N.  Smith,  ed.s.  (1982a}.  The  Structure  of  Phonological  Representations, 
Part  I.  Dordrecht,  Holland:  Foris  Publications. 

van  der  Hulst,  H.,  and  N.  Smith,  cds.  (1982b).  The  Structure  of  Phonological  Representations, 
Part  II.  Dordrecht,  Holland:  Foris  Publications. 


36 


& 


